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BACKGROUND OP THE INVENTION 

Fftftld of th» Tnvention 

This invention relates to development of novel binding 
mini-proteins, and especially micro-proteins, by an 
iterative process of mutagenesis, expression, affinity 
selection, and amplification. In this process, a gene 
encoding a mini-protein potential binding domain, said gene 
being obtained by random mutagenesis of a limited number of 
predetermined codons, is fused to a genetic element which 
causes the resulting chimeric expression product to be 
displayed on the outer surface of a virus (especially a 
filamentous phage) or a cell. Affinity selection is then 
used to identify viruses or cells whose genome includes 
such a fused gene which coded for the protein which bound to 
the chromatographic target. 
popr-^pMnn of the Related Art 

The amino acid sequence of a protein determines its 
three-dimensional (3D) structure, which in turn determines 
protein function. Some residues on the polypeptide chain 
are more important than others in determining the 3D 
structure of a protein, and hence its ability to bind, non- 
covalently, but very tightly and specifically, to 
25 characteristic target molecules. 

-Protein engineering" is the art of manipulating the 
sequence of a protein in order, fi^, to alter its binding 
characteristics. The factors affecting protein binding are 
known, but designing new complementary surfaces has proved 
30 difficult. Quiocho fit al*. (QUI087) suggest it is unlikely 
that, using current protein engineering methods, proteins 
can be constructed with binding properties superior to those 
of proteins that occur naturally. 

Nonetheless, there have been some isolated successes. 
35 For example, Wilkinson fit iUU (WILK84) reported that a 
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mutant of the tyrosyl tRNA synthetase of Basttlua 
pre ^nt-.hermophilus with the mutation Thr 51 -->Pro exhibits a 
100 -fold increase in affinity for ATP. 

With the development of recombinant DNA techniques, it 
became possible to obtain a mutant protein by mutating the 
gene encoding the native protein and then expressing the mu- 
tated gene. Several mutagenesis strategies are known. One, 
-protein surgery" (DILL87) , involves the introduction of one 
or more predetermined mutations within the gene of choice. 
A single polypeptide of completely predetermined sequence is 
expressed, and its binding characteristics are evaluated. 

At the other extreme is random mutagenesis by means of 
relatively nonspecific mutagens such as radiation and 
various chemical agents. See Ho st aL (HOCJ85) and 
15 Lehtovaara, EP Appln. 285,123. 

It is possible to randomly vary predetermined nucleo- 
tides using a mixture of bases in the appropriate cycles of 
a nucleic acid synthesis procedure. (OLIP86, 0LIP87) The 
proportion of bases in the mixture, for each position of a 
codon, will determine the frequency at which each amino acid 
will occur in the polypeptides expressed from the degenerate 
DNA population. (REID88a; VERS86a; VERS86b) . The problem 
of unequal abundance of DNA encoding different amino acids 
is not discussed. 

Ferenci and collaborators have published a series of 
papers on the chromatographic isolation of mutants of the 
maltose- transport protein LamB of JL. (FERE82a, FERE82b, 

FERE83 , FERE 84, CLUN84, HEIN87 and papers cited therein). 
The mutants were either spontaneous or induced with nonspe- 
cific chemical mutagens. Levels of mutagenesis were picked 
to provide single point mutations or single insertions of 
two residues. No multiple mutations were sought or found. 

While variation was seen in the degree of affinity for 
the conventional LamB substrates maltose and starch, there 
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was no selection for affinity to a target molecule not bound 
at all by native LamB, and no multiple mutations were sought 
or found. FERE84 speculated that the affinity chromato- 
graphic selection technique could be adapted to development 
5 of similar mutants of other "important bacterial surface- 
located enzymes", and to selecting for mutations which 
result in the relocation of an intracellular bacterial 
protein to the cell surface. Ferenci 1 s mutant surface 
proteins would not, however, have been chimeras of a 
10 bacterial surface protein and an exogenous or heterologous 

binding domain. 

Ferenci also taught that there was no need to clone the 
structural gene, or to know the protein structure, active 
site, or sequence. The method of the present invention, 
15 however, specifically utilizes a cloned structural gene. It 
is not possible to construct and express a chimeric, outer 
surface- directed potential binding protein- encoding gene 

without cloning. 

Ferenci did not limit the mutations to particular loci 

20 Substitutions were limited by the nature of the mutagen 
rather than by the desirability of particular amino acid 
types at a particular site. In the present invention, 
knowledge of the protein structure, active site and/or 
sequence is used as appropriate to predict which residues 

25 are most likely to affect binding activity without unduly 
destabilizing the protein, and the mutagenesis is focused 
upon those sites. Ferenci does not suggest that surface 
residues should be preferentially varied. In consequence, 
Ferenci 1 s selection system is much less efficient than that 

30 disclosed herein. 

A number of researchers have directed unmyrtafcgd. foreign 
antigenic epitopes to the surface of bacteria or phage, 
fused to a native bacterial or phage surface protein, and 
demonstrated that the epitopes were recognized by antibod- 

35 ies. Thus, Charbit, et al. (CHAR86a,b) genetically inserted 
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the C3 epitope of the VP1 coat protein of poliovirus into 
the Lams outer membrane protein of JL. £©li' and determined 
immunologically that the C3 epitope was exposed on the 
bacterial cell surface. Charbit, et al. (CHARB7) likewise 
5 produced chimeras of LamB and the A (or B) epitopes of the 
preS2 region of hepatitis B virus. 

A chimeric LacZ/OmpB protein has been expressed in t. 
£211 and is, depending on the fusion, directed to either the 
outer membrane or the periplasm (SILH77) . A chimeric 
10 LacZ/OmpA surface protein has also been expressed and 
displayed on the surface of E^. £Qli cells (WEIN83) . Others 
have expressed and displayed on the surface of a cell 
chimeras of other bacterial surface proteins, such as i. 
SOU type 1 fimbriae (HEDE89) and pact-erodes SSdusiifi type 
15 1 fimbriae ( JENN89 ) . In none of the recited cases was the 
inserted genetic material mutagenized. 

Dulbecco (DULB86) suggests a procedure for incor- 
porating a foreign antigenic epitope into a viral surface 
protein so that the expressed chimeric protein is displayed 
20 on the surface of the virus in a manner such that the 
foreign epitope is accessible to antibody. In 1985 Smith 
(SMIT85) reported inserting a nonfunctional segment of the 
SfioRI endonuclease gene into gene III of bacteriophage fl, 
"in phase". The gene III protein is a minor coat protein 
25 necessary for infectivity. Smith demonstrated that the 
recombinant phage were adsorbed by immobilized antibody 
raised against the fififiRI endonuclease, and could be eluted 
with acid. De la Cruz st aJU (DELA88) have expressed a 
fragment of the repeat region of the circumsporozoite 
30 protein from Plasmodium. islsiSSSM on the surface of M13 as 
an insert in the gene III protein. They showed that the 
recombinant phage were both antigenic and immunogenic in 
rabbits, and that such recombinant phage could be used for 
B epitope mapping. The researchers suggest that similar 
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r combinant phage coald be used for T pitope mapping and 
for vaccine development . 

None of these researchers suggested mutagenesis of the 
inserted material, nor is the inserted material a complete 
binding domain conferring on the chimeric protein the 
ability to bind specifically to a receptor other than the 
antigen combining site of an antibody. 

McCaf ferty fit aJU (MCCA9 0 ) expressed a fusion of an Fv 
fragment of an antibody to the N- terminal of the pill 
protein. The Fv fragment was not mutated. 

Parmley and Smith (PARM88) suggested that an epitope 
library that exhibits all possible hexapeptides could be 
constructed and used to isolate epitopes that bind to 
antibodies. In discussing the epitope library, the authors 
did not suggest that it was desirable to balance the 
representation of different amino acids. Nor did they teach 
that the insert should encode a complete domain of the 
exogenous protein. Epitopes are considered to be unstruc- 
tured peptides as opposed to structured proteins. 

Scott and Smith (SCOT90) and Cwirla fit al* (CWIR90 ) 
prepared -epitope libraries- in which potential hexapeptide 
epitopes for a target antibody were randomly mutated by 
fusing degenerate oligonucleotides, encoding the epitopes, 
with gene III of fd phage, and expressing the fused gene xn 
phage-infected cells. The cells manufactured fusion phage 
which displayed the epitopes on their surface; the phage 
which bound to immobilized antibody were eluted with acid 
and studied. In both cases, the fused gene featured a 
segment encoding a spacer region to separate the varxable 
region from the wild type pIH sequence so that the varied 
amino acids would not be constrained by the nearby pill 
sequence. Devlin fit ^ (DEVL90) similarly screened, usxng 
M13 phage, for random 15 residue epitopes recognized by 
streptavidin. Again, a spacer was used to move the random 
peptides away from the rest of the chimeric phage protexn. 
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These referenc s therefore taught away from constraining the 
conformational repertoire of the mutated residues. 

Another problem with the Scott and Smith, Cwirla fit 
al. . and Devlin et al^, libraries was that they provided a 
5 highly biased sampling of the possible amino acids at each 
position. Their primary concern in designing the degenerate 
oligonucleotide encoding their variable region was to ensure 
that all twenty amino acids were encodible at each position; 
a secondary consideration was minimizing the frequency of 

10 occurrence of stop signals. Consequently, Scott and Smith 
and Cwirla fit al^ employed NNK (N=equal mixture of G, A, T, 
C; K=equal mixture of G and T) while Devlin fit used NNS 
<S=equal mixture of G and C) . There was no attempt to 
minimize the frequency ratio of most favored- to- least 

15 favored amino acid, or to equalize the rate of occurrence of 
acidic *nH basic amino acids. 

Devlin fit aLu characterized several affinity- selected 
streptavidin-binding peptides, but did not measure the 
affinity constants for these peptides. Cwirla fit sl^ did 

20 determine the affinity constant for his peptides, but were 
disappointed to find that his best hexapeptides had affini- 
ties (350-300nM) , "orders of magnitude" weaker than that of 
the native Met -enkephalin epitope (7nM) recognized by the 
target antibody. Cwirla fit al^ speculated that phage 

25 bearing peptides with higher affinities remained bound under 
acidic elution, possibly because of multivalent interactions 
between phage (carrying about 4 copies of pill) and the 
divalent target IgG. Scott and Smith were able to find 
peptides whose affinity for the target antibody (A2) was 

30 comparable to that of the reference myohemerythrin epitope 
(50nM) . However, Scott and Smith likewise expressed concern 
that some high-affinity peptides were lost, possibly through 
irreversible binding of fusion phage to target. 

Lam, et al. (LAM91) created a pentapeptide library by 

35 nonbiological synthesis on solid supports. While they teach 
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that it is desirable to obtain the universe of possible 
random pentapeptides in roughly equimolar proportions, they 
deliberately excluded cysteine, to eliminate any possibility 
of disulfide crosslinking. 
5 Ladner, Glick, and Bird, WO88/06630 (publ. 7 Sept- 1988 

and having priority from US application 07/021,046, assigned 
to Genex Corp.) (LGB) speculate that diverse single chain 
antibody domains (SCAD) may be screened for binding to a 
particular antigen by varying the DNA encoding the combining 

10 determining regions of a single chain antibody, subcloning 
the SCAD gene into the gpV gene of phage X so that a 
SCAD/gpV chimera is displayed on the outer surface of phage 
X, and selecting phage which bind to the antigen through 
affinity chromatography. The only antigen mentioned is 

15 bovine growth hormone. No other binding molecules, targets, 
carrier organisms, or outer surface proteins are discussed. 
Nor is there any mention of the method or degree of 
mutagenesis. Furthermore, there is no teaching as to the 
exact structure of the fusion nor of how to identify a 

20 successful fusion or how to proceed if the SCAD is not 
displayed . 

Ladner and Bird, WO88/06601 (publ. 7 September 1988) 
suggest that single chain "pseudodimeric n repressors (DNA- 
binding proteins) may be prepared by mutating a putative 
25 linker peptide followed by in vivo selection that mutation 
and selection may be used to create a dictionary of recogni- 
tion elements for use in the design of asymmetric repres- 
sors. The repressors are not displayed on the outer surface 
of an organism. 

30 Methods of identifying residues in protein which can be 

replaced with a cysteine in order to promote the formation 
of a protein- stabilizing disulfide bond are given in 
Pantoliano and Ladner, U.S. Patent No. 4,903,773 (PANT90) , 
Pantoliano and Ladner (PANT87) , Pabo and Suchenek (PAB086) , 

35 MATS89, and SAUE86. 
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Ladner, ££ al. . W090/02809 describes semirandom 
mutagenesis ("variegation") of known proteins displayed as 
domains of semiartif icial outer surface proteins of 
bacteria, phage or spores, and affinity selection of mutants 
5 having desired binding characteristics. The smallest 
proteins specif ically, mentioned in W090/02809 are crambin 
(3:40, 4:32, 16:26 disulfides; 46 AAs), the third domain of 
ovomucoid (8:38, 16:35 and 24:56 disulfides; 56 AAs), and 
BPTI (5:55, 14:38, 30:51 disulfides; 58 AAs) . W090/02809 
10 also specifically describes a strategy for "variegating" a 
codon to obtain a mix of all twenty amino acids at that 
position in approximately equal proportions. 

Bass, et al. (BASS90) fused human growth hormone to the 
gene III protein of M13 phage. He suggested that hGH and 
15 other "large proteins" might be mutated and "binding 
selections" applied. 

SUMMARY OF THE INVENTION 
A polypeptide is a polymer composed of a single chain 
of the same or different amino acids joined by peptide 
20 bonds. Linear peptides can take up a very large number of 
different conformations through internal rotations about the 
main chain single bonds of each a carbon. These rotations 
are hindered to varying degrees by side groups, with glycine 
interfering the least, and valine, isoleucine said, especial - 
25 ly, proline, the most. A polypeptide of 20 residues may 
have 10 20 different conformations which it may assume by 
various internal rotations. 

Proteins are polypeptides which, as a result of 
stabilizing interactions between amino acids that are not 
30 necessarily in adjacent positions in the chain, have folded 
into a well-defined conformation. This folding is usually 
essential to their biological activity. 

For polypeptides of 40-60 residues or longer, 
noncovalent forces such as hydrogen bonds, salt bridges, and 
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hydrophobic interactions are sufficient to stabilize a 
particular folding or conformation. The polypeptide's 
constituent segments are held to more or less that conforma- 
tion unless it is perturbed by a denaturant such as high 
5 temperature, or low or high pH, whereupon the polypeptide 
unfolds or "melts". The smaller the peptide, the more 
likely it is that its conformation will be determined by the 
environment. If a small unconstrained peptide has biologi- 
cal activity, the peptide ligand will be in essence a random 
10 coil until it comes into proximity with its receptor. The 
receptor accepts the peptide only in one or a few conforma- 
tions because alternative conformations are disfavored by 
unfavorable van der Waals and other non-covalent interac- 
tions . 

!5 Small polypeptides have potential advantages over 

larger polypeptides when used as therapeutic or diagnostic 
agents, including (but not limited to) : 

a) better penetration into tissues, 

b) faster elimination from the circulation (important for 
20 imaging agents) , 

c) lower antigenicity, and 

d) higher activity per mass. 

Moreover, polypeptides, especially those of less than 
about 40 residues, have the advantage of accessibility yia. 
25 chemical synthesis; polypeptides of under about 30 residues 
are particularly preferred. Thus, it would be desirable to 
be able to employ the combination of mutation and affinity 
selection to identify small polypeptides which bind a target 
of choice. 

30 Most polypeptides of this size, however, have disadvan- 

tages as binding molecules. According to Olivera g£. aLt. 
(OLIV90a) : "Peptides in this size range normally equilibrate 
among many conformations (in order to have a fixed 
conformation, proteins generally have to be much larger) . " 

35 Specific binding of a peptide to a target molecule requires 
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the peptide to take up one conformation that is 
complementary to the binding site. For a decapeptide with 
three isoenergetic conformations (iUSU, $ strand, a helix,^ 
and reverse turn) at each residue, there are about 6. -10 
possible overall conformations. Assuming these conforma- 
tions to be egui-probable for the unconstrained decapeptide, 
if only one of the possible conformations bound to the 
binding site, then the affinity of the peptide for the 
target would be expected to be about 6-10* higher if it 
could be constrained to that single effective conformation. 
Thus the unconstrained decapeptide, relative to a 
decapeptide constrained to the correct conformation, would 
be expected to exhibit lower affinity. It would also 
exhibit lower specificity, since one of the other confor- 
mations of the unconstrained decapeptide might be one which 
bound tightly to a material other than the intended target. 
By way of corollary, it could have less resistance to 
degradation by proteases, since it would be more likely to 
provide a binding site for the protease. 

The present invention overcomes these problems, while 
retaining the advantages of smaller polypeptides, by 
identifying novel mini -proteins having the desired binding 
characteristics. Mini-Proteins are small polypeptides 
which, while too small to have a stable conformation as a 
result of noncovalent forces alone, are covalently 
crosslinked (e^, by disulfide bonds) into a stable 
conformation and hence have biological activities more 
typical of larger protein molecules than of unconstrained 
polypeptides of comparable size. THe mini-proteins with 
which the present invention is particularly concerned fall 
into two categories: (a) disulfide-bonded micro -proteins of 
less than 40 amino acids; and (b) metal ion- coordinated 
mini-proteins of less than 60 amino acids. 
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The present invention relates to the construction, 
expression, and selection of mutated genes that specify 
novel mini -proteins with desirable binding properties, as 
well as these mini-proteins themselves, and the "libraries" 
5 of mutant "genetic packages" used to display the mini- 
proteins to a potential "target" material. The "targets" 
may be, but need not be, proteins. Targets may include 
other biological or synthetic macromolecules as well as 
other organic and inorganic substances. 

10 The prior application, WO90/02809 generally teaches 

that stable protein domains may be mutated in order to 
identify new proteins with desirable binding 
characteristics. Among the suitable "parental" proteins 
which it specifically identifies as useful for this purpose 

15 are three proteins- -BPTI (58 residues), the third domain of 
ovomucoid (56 residues), and crambin (46 residues) - -which 
are in the size range of 40-60 residues wherein noncovalent 
interactions between nonadjacent amino acids become 
significant; all three also contain three disulfide bonds 

20 .that enhance the stability of the molecule. 

Nowhere in W090/02809 does one find any specific 
recognition that a polypeptide with less than 40 residues, 
and especially those with only one or two disulfide bonds, 
would have sufficient stability to serve as a "scaffolding" 

25 for mutational variation. These "micro -proteins" are, 
nonetheless, of great utility, as previously indicated. 

WO90/02809 also suggests the use of a protein, azurin, 
having a different form of crosslink (Cu:CYS,HIS,HIS,MET) . 
However, azurin has 128 amino acids, so it cannot possibly 

30 be considered a mini-protein. The present invention 
relates to the use of mini -proteins of less than 60 amino 
acids which feature a metal ion -coordinated crosslink. 

By virtue of the present invention, proteins are 
obtained which can bind specifically to targets other than 

35 the antigen- combining sites of antibodies. A protein is not 
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to be considered a "binding protein" merely because it can 
be bound by an antibody (see definition of "binding protein" 
which follows) . While almost any amino acid sequence of 
more than about 6-8 amino acids is likely, when linked to an 
5 immunogenic carrier, to elicit an immune response, any given 
random polypeptide is unlikely to satisfy the stringent 
definition of "binding protein" with respect to minimum 
affinity and specificity for its substrate. It is only by 
testing numerous random polypeptides simultaneously (and, in 

10 the usual case, controlling the extent and character of the 
sequence variation, i^, limiting it to residues of a 
potential binding domain having a stable structure, the 
residues being chosen as more likely to affect binding than 
stability) that this obstacle is overcome. 

15 The appended claims are hereby incorporated by refer- 

ence into this specification as an enumeration of the 
preferred embodiments. 

pftTEF DES?»TPT1QH ft p ,pBB DRAWINGS 

20 Figure_l shows the main chain of scorpion toxin (Brookhaven 
Protein Data Bank entry 1SN3) residues 20 through 42. 
CYSj5 and CYS 41 are shown forming a disulfide. In the 
native protein these groups form disulfides to other 
cysteines, but no main- chain motion is required to 

25 bring the gamma sulphurs into acceptable geometry. 

Residues, other than GLY, are labeled at the 0 carbon 
with the one -letter code. 

^TATT-ED DF ffre TPTIOW O F ™* PREFERRED EMBODIMENTS 

30 I. INTRODUCTION 

The fundamental principle of the invention is one of 
forced evolution. In nature, evolution results from the 
combination of genetic variation, selection for advantageous 
traits, and reproduction of the selected individuals, 
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thereby enriching the population for the trait. The present 
invention achieves genetic variation through controlled 
random mutagenesis (" variegation ") of DNA, yielding a 
mixture of DNA molecules encoding different but related 
5 potential binding domains that are mutants of micro- 
proteins. It selects for mutated genes that specify novel 
proteins with desirable binding properties by 1) arranging 
that the product of each mutated gene be fllsplayefl on the 
outer surface of a replicable genetic package (GP) (a cell, 

10 spore or virus) that contains the gene, and 2) using 
affinity selection selection for binding to the target 
material -- to enrich the population of packages for those 
packages containing genes specifying proteins with improved 
binding to that target material. Finally, enrichment is 

15 achieved by allowing only the genetic packages which, by 
virtue of the displayed protein, bound to the target, to 
reproduce. The evolution is "forced" in that selection is 
for the target material provided and in that particular 
codons are mutagenized at higher- than -natural frequencies. 

20 The display strategy is first perfected by modifying a 

genetic package to display a stable, structured domain (the 
■ initial po tential binding domain", IPBD) for which an 
affinity molecule (which may be an antibody) is obtainable. 
The success of the modifications is readily measured by, 

25 e.g. , determining whether the modified genetic package binds 
to the affinity molecule. 

The IPBD is chosen with a view to its tolerance for 
extensive mutagenesis. Once it is known that the IPBD can 
be displayed on a surface of a package and subjected to 

30 affinity selection, the gene encoding the IPBD is subjected 
to a special pattern of multiple mutagenesis, here termed 
" variegation " . which after appropriate cloning and amplifi- 
cation steps leads to the production of a population of 
genetic packages each of which displays a single potential 

35 binding domain (a mutant of the IPBD) , but which collective- 
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ly display a multitude of different though structurally 
related potential binding domains (PBDs). Each genetic 
package carries the version of the pfed. gene that encodes the 
PBD displayed on the surface of that particular package. 
5 Affinity selection is then used to identify the genetic 
packages bearing the PBDs with the desired binding charac- 
teristics, and these genetic packages may then be amplified. 
After one or more cycles of enrichment by affinity selection 
and amplification, the DNA encoding the successful binding 

10 domains (SBDs) may then be recovered from selected packages. 

If need be, the DNA from the SBD-bearing packages may 
then be further "variegated", using an SBD of the last round 
of variegation as the "parental potential binding domain" 
(PPBD) to the next generation of PBDs, and the process 

15 continued until the worker in the art is satisfied with the 
result. Because of the structural and evolutionary 
relationship between the IPBD and the first generation of 
PBDs, the IPBD is also considered a "parental potential 
binding domain" (PPBD) . 

20 When micro-proteins are variegated, the residues which 

are covalently crosslinked in the parental molecule are left 
unchanged, thereby stabilizing the conformation. For 
example, in the variegation of a disulfide bonded micro- 
protein, certain cysteines are invariant so that under the 

25 conditions of expression and display, covalent crosslinks 
f e.a. . disulfide bonds between one or more pairs of 
cysteines) form, and substantially constrain the conforma- 
tion which may be adopted by the hypervariable linearly 
intermediate amino acids. In other words, a constraining 

30 scaffolding is engineered into polypeptides which are 
otherwise extensively randomized. 

Once a micro-protein of desired binding characteristics 
is characterized, it may be produced, not only by 
recombinant DNA techniques, but also by nonbiological 

35 synthetic methods. 
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For the purposes of the appended claims, a protein P is 
a "binding protein" if for at least one molecular, ionic or 
atomic species A, other than the variable domain of an 
antibody, the dissociation constant K D (P,A) < 10 
moles/liter {preferably, < 10' 7 moles/liter) . 

The exclusion of "variable domain of an antibody" in 
(l) above is intended to make clear that for the purposes 
herein a protein is not to be considered a "binding protein" 
merely because it is antigenic. 

Most larger proteins fold into distinguishable globules 
called domains (ROSS81) . Protein domains have been defined 
various ways; definitions of "domain" which emphasize 
stability -- retention of the overall structure in the face 
of perturbing forces such as elevated temperatures or 
15 chaotropic agents -- are favored, though atomic coordinates 
and protein sequence homology are not completely ignored. 

When a domain of a protein is primarily responsible for 
the protein's ability to specif ically bind a chosen target, 
it is referred to herein as a "binding domain" (BD) . 
20 The term "variegated DNA" (vgDNA) refers to a mixture 

of DNA molecules of the same or similar length which, when 
aligned, vary at some codons so as to encode at each such 
codon a plurality of different amino acids, but which encode 
only a single amino acid at other codon positions. It is 
25 further understood that in variegated DNA, the codons which 
are variable, and the range and frequency of occurrence of 
the different amino acids which a given variable codon 
encodes, are determined in advance by the synthesizer of the 
DNA, even though the synthetic method does not allow one to 
30 know, a priori, the sequence of any individual DNA molecule 
in the mixture. The number of designated variable codons in 
the variegated DNA is preferably no more than 20 codons, and 
more preferably no more than 5-10 codons. The mix of amino 
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acids encoded at each variable codon may differ from codon 
to codon. 

A population of genetic packages into which variegated 
DNA has been introduced is likewise said to be "variegated" . 
5 For the purposes of this invention, the term "potential 

binding protein" (PBP) refers to a protein encoded by one 
species of DNA molecule in a population of variegated DNA 
wherein the region of variation appears in one or more 
subsequences encoding one or more segments of the polypep- 
10 tide having the potential of serving as a binding domain for 
the target substance. 

A "chimeric protein" is a fusion of a first amino acid 
sequence (protein) with a second amino acid sequence 
defining a domain foreign to and not substantially 
15 homologous with any domain of the first protein. A chimeric 
protein may present a foreign domain which is found (albeit 
in a different protein) in an organism which also expresses 
the first protein, or it may be an "interspecies", 
"intergeneric", etc. fusion of protein structures expressed 
20 by different kinds of organisms. 

One amino acid sequence of the chimeric proteins of the 
present invention is typically derived from an outer surface 
protein of a "genetic package" (GP) as hereafter defined. 
One which displays a PBD on its surface is a GP (PBD) . The 
25 second amino acid sequence is one which, if expressed alone, 
would have the characteristics of a protein (or a domain 
thereof) but is incorporated into the chimeric protein as a 
recognizable domain thereof. It may appear at the amino or 
carboxy terminal of the first amino acid sequence (with or 
30 without an intervening spacer), or it may interrupt the 
first amino acid sequence. The first amino acid sequence 
may correspond exactly to a surface protein of the genetic 
package, or it may be modified, e_^, to facilitate the 
display of the binding domain. 
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II. MICRO- AND OTHER MINI -PROTEINS 

In the present invention, disulfide bonded micro- 
proteins and metal -containing mini -proteins are used both 
as IPBDs in verifying a display strategy, and as PPBDs in 
5 actually seeking to obtain a BD with the desired target - 
binding characteristics. Unless otherwise stated or 
required by context, references herein to IPBDs should be 
taken to apply, pn-»M« mutandis, to PPBDs as well. 

For the purpose of the appended claims, a micro -protein 

10 has between about six and about forty residues; micro- 
proteins are a subset of mini-proteins, which have less than 
about sixty residues. Since micro-proteins form a subset of 
mini-proteins, for convenience the term mini-proteins will 
be used on occasion to refer to both disulfide -bonded micro- 

15 proteins and metal- coordinated mini -proteins . 

The IPBD may be a mini -protein with a known binding 
activity, or one which, while not possessing a known binding 
activity, possesses a secondary or higher structure that 
lends itself to binding activity (clefts, grooves, £fcc_J . 

20 When the IPBD does have a known binding activity, it need 
not have any specific affinity for the target material. The 
IPBD need not be identical in sequence to a naturally- 
occurring mini-protein; it may be a "homologue" with an 
amino acid sequence which "substantially corresponds" to 

25 that of a known mini -protein, or it may be wholly 
artificial . 

In determining whether sequences should be deemed to 
"substantially correspond", one should consider the 
following issues : the degree of sequence similarity when the 

30 sequences are aligned for best fit according to standard 
algorithms, the similarity in the connectivity patterns of 
any crosslinks ( e.g. . disulfide bonds) , the degree to which 
the proteins have similar three-dimensional structures, as 
indicated by, e.g. . X-ray diffraction analysis or NMR, and 

35 the degree to which the sequenced proteins have similar 
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biological activity. In this context, it should be noted 
that among the serine protease inhibitors, there are 
families of proteins recognized to be homologous in which 
there are pairs of members with as little as 30% sequence 

5 homology. 

A candidate IPBD should meet the following criteria: 

1) a domain exists that will remain stable under the 
conditions of its intended use (the domain may 
comprise the entire protein that will be inserted, 

10 e^ a -conotoxin GI (OLIV90a) , or OCTI-III (MCWH89) , 

2) knowledge of the amino acid sequence is obtainable, 
and 

3) a molecule is obtainable having specific and high 
affinity for the IPBD, abbreviated AfM(IPBD) . 

15 If only one species of molecule having affinity for 

IPBD (AfM(IPBD) ) is available, it will be used to: a) detect 
the IPBD on the GP surface, b) optimize expression level and 
density of the affinity molecule on the matrix, and c) 
determine the efficiency and sensitivity of the affinity 

20 separation. One would prefer to have available two species 
of AfM(IPBD) , one with high and one with moderate affinity 
for the IPBD. The species with high affinity would be used 
in initial detection and in determining efficiency and 
sensitivity, and the species with moderate affinity would be 

25 used in optimization. 

If the IPBD is not itself a known binding protein, or 
if its native target has not been purified, an antibody 
raised against the IPBD may be used as the affinity 
molecule. Use of an antibody for this purpose should not be 

30 taken to mean that the antibody is the ultimate target. 

There are many candidate IPBDs for which all of the 
above information is available or is reasonably practical to 
obtain, for example, CMTI-IU (29 residues) (CMTI-type 
inhibiters are described in 0TLE87, FAVE89, WIEC85, MCWH89, 

35 BODE89, HOIA89a,b) , heat -stable enterotoxin (ST- la of L 
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coli ) (18 residues) (€U»R89, BHAT86, SEKI85, SHIM87, TAKA85, 
TAKE90, TH0M85a,b, YOSH85, DALL9 0 , DWAR89, GARI87, GUZM89, 
GUZM90, HODG84, KUB089, KUPE90, OKAM87, OKAM88, AND OKAM9 0 ) , 
a-Conotoxin GI (13 residues) (HASHB5, ALMQ89), /i-Conotoxin 
5 GUI (22 residues) (HID090) , and Conus King Kong micro- 
protein (27 residues) (WOOD90) . Structural information can 
be obtained from X-ray or neutron diffraction studies, NMR, 
chemical cross linking or labeling, modeling from known 
structures of related proteins, or from theoretical 
10 calculations. 3D structural information obtained by X-ray 
diffraction, neutron diffraction or NMR is preferred because 
these methods allow localization of almost all of the atoms 
to within defined limits. Table 50 lists several preferred 
IPBDs. 

15 Mutations may reduce the stability of the PBD. Hence 

the chosen IPBD should preferably have a high melting 
temperature, e.g., at least 50°C, and preferably be stable 
over a wide pH range, e.g., 8.0 to 3.0, but more preferably 
11.0 to 2.0, so that the SBDs derived from the chosen IPBD 

20 by mutation and selection-through-binding will retain 
sufficient stability. Preferably, the substitutions in the 
IPBD yielding the various PBDs do not reduce the melting 
point of the domain below -40*C. It will be appreciated 
that mini-proteins contain covalent crosslinks, such as one 

25 or more disulfides, are therefore are likely to be 
sufficiently stable. 

In vitro, disulfide bridges can form spontaneously in 
polypeptides as a result of air oxidation. Matters are more 
complicated is vivo . Very few intracellular proteins have 

30 disulfide bridges, probably because a strong reducing 
environment is maintained by the glutathione system. 
Disulfide bridges are common in proteins that travel or 
operate in intracellular spaces, such as snake venoms and 
other toxins ( e.g. . conotoxins, charybdotoxin, bacterial 
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enterotoxins) , peptide hormones, digestive enzymes, 
complement proteins, immunoglobulins, lysozymes, protease 
inhibitors (BPTI and its homologues, CMTI-III ( Cucuxbita 
maxima trypsin inhibitor III) and its homologues, hirudin, 
5 etc. ) and milk proteins. 

Disulfide bonds that close tight intrachain loops have 
been found in pepsin, thioredoxin, insulin A- chain, silk 
fibroin, and lipoamide dehydrogenase. The bridged cysteine 
residues are separated by one to four residues along the 

10 polypeptide chain. Model building, X-ray diffraction 
analysis, and NMR studies have shown that the a carbon path 
of such loops is usually flat and rigid. 

There are two types of disulfide bridges in immuno- 
globulins. One is the conserved intrachain bridge, spanning 

15 about 60 to 70 amino acid residues and found, repeatedly, in 
almost every immunoglobulin domain. Buried deep between the 
opposing J8 sheets, these bridges are shielded from solvent 
and ordinarily can be reduced only in the presence of 
denaturing agents. The remaining disulfide bridges are 

20 mainly interchain bonds and are located on the surface of 
the molecule; they are accessible to solvent and relatively 
easily reduced (STEI85) . The disulfide bridges of the 
micro-proteins of the present invention are intrachain 
linkages between cysteines having much smaller chain 

25 spacings . 

When a micro-protein contains a plurality of disulfide 
bonds, it is preferable that at least two cysteines be 
clustered, i.e., are immediately adjacent along the chain (- 
C-C-) or are separated by a single amino acid (-C-X-C-) . In 

30 either case, the two clustered cysteines become unable to 
pair with each other for steric reasons, and the number of 
realizable topologies is reduced. 

An intrachain disulfide bridge connecting amino acids 
3 and 8 of a 16 residue polypeptide will be said herein to 

35 have a span of 4. If amino acids 4 and 12 are also 
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disulfide bonded, then they form a second span of 7. 
Together, the four cysteines divide the polypeptide into 
four intercysteine segments (1-2, 5-7, 9-11, and 13-16) . 
(Note that there is no segment between Cys3 and Cys4.) The 
5 connectivity pattern of a crosslinked micro-protein is a 
simple description of the relative location of the termini 
of the crosslinks. For example, for a micro -protein with 
two disulfide bonds, the connectivity pattern "1-3, 2-4" 
means that the first crosslinked cysteine is disulfide 
10 bonded to the third crosslinked cysteine (in the primary 
sequence) , and the second to the fourth. 

The degree to which the crosslink constrains the 
conformational freedom of the mini-protein, and the degree 
to which it stabilizes the mini-protein, may be assessed by 
15 a number of means. These include absorption spectroscopy 
(which can reveal whether an amino acid is buried or 
exposed), circular dichroism studies (which provides a 
general picture of the helical content of the protein), 
nuclear magnetic resonance imaging (which reveals the number 
20 of nuclei in a particular chemical environment as well as 
the mobility of nuclei) , and X-ray or neutron diffraction 
analysis of protein crystals. The stability of the mini- 
protein may be ascertained by monitoring the changes in 
absorption at various wavelengths as a function of 
25 temperature, pH, afcc^; buried residues become exposed as the 
protein unfolds. Similarly, the unfolding of the mini- 
protein as a result of denaturing conditions results in 
changes in NMR line positions and widths. Circular 
dichroism (CD) spectra are extremely sensitive to confor- 
30 mat ion. 

The variegated disulf ide-bonded micro -proteins of the 
present invention fall into several classes. 

r 1aag t micro-proteins are those featuring a single 
pair of cysteines capable of interacting to form a disulfide 
35 bond, said bond having a span of no more than about nine 
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residues. This disulfide bridge preferably has a span of at 
least two residues; this is a function of the geometry of 
the disulfide bond. When the spacing is two or three resi- 
dues, one residue is preferably glycine in order to reduce 
the strain on the bridged residues. The upper limit on 
spacing is less precise, however, in general, the greater 
the spacing, the less the constraint on conformation imposed 
on the linearly intermediate amino acid residues by the 

disulfide bond. 

The main chain of such a peptide has very little 
freedom, but is not stressed. The free energy released when 
the disulfide forms exceeds the free energy lost by the 
main- chain when locked into a conformation that brings the 
cysteines together. Having lost the free energy of 
15 disulfide formation, the proximal ends of the side groups 
are held in more or less fixed relation to each other. When 
binding to a target, the domain does not need to expend free 
energy getting into the correct conformation. The domain 
can not jump into some other conformation and bind a non- 
20 target. 

A disulfide bridge with a span of 4 or 5 is especially 
preferred. If the span is increased to 6, the constraining 
influence is reduced. In this case, we prefer that at least 
one of the enclosed residues be an amino acid that imposes 
25 restrictions on the main-chain geometry. Proline imposes 
the most restriction. Valine and isoleucine restrict the 
main chain to a lesser extent. The preferred position for 
this constraining non- cysteine residue is adjacent to one of 
the invariant cysteines, however, it may be one of the other 
bridged residues. If the span is seven, we prefer to 
include two amino acids that limit main- chain conformation. 
These amino acids could be at any of the seven positions, 
but are preferably the two bridged residues that are 
immediately adjacent to the cysteines. If the span is eight 
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or nine, additional constraining amino acids may be 
provided. 

While a class I micro-protein may have up to 40 amino 
acids, more preferably it is no more than 20 amino acids. 

The disulfide bond of a class I micro -proteins is 
exposed to solvent. Thus, one usually should avoid exposing 
the variegated population of GPs that display class I micro- 
proteins to reagents that rupture disulfides. 

ma as II mi cro-proteins are those featuring a single 
disulfide bond having a span of greater than nine amino 
acids. The bridged amino acids form secondary structures 
which help to stabilize their conformation. Preferably, 
these intermediate amino acids form hairpin supersecondary 
structures such as those schematized below: 

, s— S 1 

- Cys - ahel ix- turn - j8st rand- Cys - 



r 



-s — s 1 



- Cys - ahel ix- turn - ahelix- Cys - 

o <i 

20 I ! s — s — I 

- Cys - /Sstrand- turn-/3s trand- Cys - 

Based on studies of known proteins, one may calculate 
the propensity of a particular residue, or of a particular 

25 dipeptide or tripeptide, to be found in an a helix, 0 strand 
or reverse turn. The normalized frequencies of occurrence 
of the amino acid residues in these secondary structures is 
given in Table 6-4 of CREI84 . For a more detailed treatment 
on the prediction of secondary structure from the amino acid 

30 sequence, see Chapter 6 of SCHU79. 

In designing a suitable hairpin structure, one may copy 
an actual structure from a protein whose three-dimensional 
conformation is known, design the structure using frequency 
data, or combine the two approaches. Preferably, one or 

35 more actual structures are used as a model, and the 
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frequency data is used to determine which mutations can be 
made without disrupting th structure. 

Preferably, no more than three amino acids lie between 
the cysteine and the beginning or end of the a helix or /? 
5 strand. 

More complex structures (such as a double hairpin) are 

also possible. 

nn rr TTTa mi ^-nrnteins are those featuring two 
disulfide bonds. They optionally may also feature secondary 
10 structures such as those discussed above with regard to 
Class II micro-proteins. With two disulfide bonds, there 
are three possible topologies; if desired, the number of 
realizable disulfide bonding topologies may be reduced by 
clustering cysteines as in heat-stable enterotoxin ST- la. 
15 m a .q TTTh mir ^-nrnteins are those featuring three or 

more disulfide bonds and preferably at least one cluster of 
cysteines as previously described. 

M oi- a i rtnaBr^^-Prnt-plM. The present invention also 
relates to mini-proteins which are not crosslinked by 
20 disulfide bonds, e.g. , analogues of finger proteins. Finger 
proteins are characterized by finger structures in which a 
metal ion is coordinated by two Cys and two His residues, 
forming a tetrahedral arrangement around it. The metal ion 
is most often zinc (II), but may be iron, copper, cobalt, 
25 £££ The "finger" has the consensus sequence (Phe or Tyr) - 
(1 AA) - Cys - (2-4 AAs) -Cys- (3 AAs) -Phe- (5 AAs) -Leu- (2 AAs) - 
His- (3 AAS) -His- (5 AAS) (BERG88; GIBS88) . While finger 
proteins typically contain many repeats of the finger motif, 
it is known that a single finger will fold in the presence 
30 of zinc ions (FRAN87; PARR88) . There is some dispute as to 
whether two fingers are necessary for binding to DNA. The 
present invention encompasses mini-proteins with either one 
or two fingers. Other combinations of side groups can lead 
to formation of crosslinks involving multivalent metal ions. 
35 Summers (SUMM91) , for example, reports an 18-amino-acid mini 
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protein found in the capsid protein of HIV-1-F1 and having 
three cysteines and one histidine that bind a zinc atom. It 
is to be understood that the target need not be a nucleic 
acid. 

G. Modified PBSs 

There exist a number of enzymes and chemical reagents 
that can selectively modify certain side groups of proteins, 
including: a) protein- tyrosine kinase, Ellmans reagent, 
methyl transferases (that methylate GLU side groups) , serine 
kinases, proline hydroxyases, vitamin-K dependent enzymes 
that convert GLU to GLA, maleic anhydride, and alkylating 
agents. Treatment of the variegated population of GP(PBD)s 
with one of these enzymes or reagents will modify the side 
groups affected by the chosen enzyme or reagent. Enzymes 
15 and reagents that do not kill the GP are much preferred. 
Such modification of side groups can directly affect the 
binding properties of the displayed PBDs. Using affinity 
separation methods, we enrich for the modified GPs that bind 
the predetermined target. Since the active binding domain 
is not entirely genetically specified, we must repeat the 
post-morphogenesis modification at each enrichment round. 
This approach is particularly appropriate with mini -protein 
IPBDs because we envision chemical synthesis of these SBDs. 
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III. VARIEGATION STRATEGY MUTAGENESIS TO OBTAIN POTENTIAL 
BINDING DOMAINS WITH DESIRED DIVERSITY 

j tj.a. Ge nerally 

When the number of different amino acid sequences 
obtainable by mutation of the domain is large when compared 
to the number of different domains which are displayable in 
detectable amounts, the efficiency of the forced evolution 
is greatly enhanced by careful choice of which residues are 
to be varied. First, residues of a known protein which are 
likely to affect its binding activity (fi^., surface 
residues) and not likely to unduly degrade its stability are 
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identified. Then all or some of the codons encoding these 
residues are varied simultaneously to produce a variegated 
population of DNA. Groups of surface residues that are 
close enough together on the surface to touch one molecule 
5 of target siittultaneously are preferred sets for simultaneous 
variegation. The variegated population of DNA is used to 
express a variety of potential binding domains, whose 
ability to bind the target of interest may then be 
evaluated. 

10 The method of the present invention is thus further 

distinguished from other methods in the nature of the hzghly 
variegated population that is produced and from which novel 
binding proteins are selected. We force the displayed 
potential binding domain to sample the nearby -sequence 

15 space- of related amino-acid sequences in an efficient, 
organized manner. Four goals guide the various variegation 
plans used herein, preferably: 1) a very large number (wu 
10 7 ) of variants is available, 2) a very high percentage of 
the possible variants actually appears in detectable 

20 amounts, 3) the frequency of appearance of the desired 
variants is relatively uniform, and 4) variation occurs only 
at a limited number of amino-acid residues, most preferably 
at residues having side groups directed toward a common 
region on the surface of the potential binding domain. 

25 This is to be distinguished from the simple use of 

indiscriminate mutagenic agents such as radiation and 
hydroxyzine to modify a gene, where there is no (or very 
oblique) control over the site of mutation. Many of the 
mutations will affect residues that are not a part of the 

30 binding domain. When chemical mutagens are directed toward 
the whole genome, most mutations occur in genes other than 
the one encoding the potential binding domain. Moreover, 
since at a reasonable level of mutagenesis, any modified 
codon is likely to be characterized by a single base change, 
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only a limited and biased range of possibilities will be 
explored. Equally remote is the use of site- specif ic 
mutagenesis techniques employing mutagenic oligonucleotides 
of nonrandomized sequence, since these techniques do not 
lend themselves to the production and testing of a large 
number of variants. While focused random mutagenesis 
techniques are known, the importance of controlling the 
distribution of variation has been largely overlooked. 

The potential binding domains are first designed at the 
amino acid level. Once we have identified which residues 
are to be mutagenized, and which mutations to allow at those 
positions, we may then design the variegated DNA which is to 
encode the various PBDs so as to assure that there is a 
reasonable probability that if a PBD has an affinity for the 
15 target, it will be detected. Of course, the number of 
independent transformants obtained and the sensitivity of 
the affinity separation technology will impose limits on the 
extent of variegation possible within any single round of 
variegation. 

There are many ways to generate diversity in a protein. 
(See RICH86, CARU85, and OLIP86.) At one extreme, we vary 
a few residues of the protein as much as possible (inter. 
alia see CARU85, CARU87, RICH86, and WHAR86) . We will call 
this approach "Focused Mutagenesis". A typical "Focused 
Mutagenesis" strategy is to pick a set of five to seven 
residues and vary each through 13-20 possibilities. An 
alternative plan of mutagenesis ("Diffuse Mutagenesis") is 
to vary many more residues through a more limited set of 
choices (See VERS86a and PAKU86) . The variegation pattern 
adopted may fall between these extremes, two residues 

varied through all twenty amino acids, two more through only 
two possibilities, and a fifth into ten of the twenty amino 
acids . 

There is no fixed limit on the number of codons which 
35 can be mutated simultaneously. However, it is desirable to 
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adopt a mutagenesis strategy which results in a reasonable 
probability that a possible PBD sequence is in fact 
displayed by at least one genetic package. Preferably, the . 
probability that a mutein encoded by the vgDNA and composed 
5 of the least favored amino acids at each variegated position 
will be displayed by at least one independent transformant 
in the library is at least 0.50, and more preferably at 
least 0.90. (Muteins composed of more favored amino acids 
would of course be more likely to occur in the same 

10 library.) 

Preferably, the variegation is such as will cause a 
typical transformant population to display 10*-10* different 
amino acid sequences by means of preferably not more than 
10-fold more (more preferably not more than 3-fold) 
15 different DNA sequences. 

For a Class I micro-protein that lacks a helices and & 
strands, one will, in any given round of mutation, 
preferably variegate each of 4-8 non- cysteine codons so that 
they each encode at least eight of the 20 possible amino 
acids. The variegation at each- codon could be customized to 
that position. Preferably, cysteine is not one of the 
potential substitutions, though it is not excluded. 

When the mini-protein is a metal finger protein, in a 
typical variegation strategy, the two Cys and two His 
residues, and optionally also the aforementioned Phe/Tyr, 
Phe and Leu residues, are held invariant and a plurality 
(usually 5-10) of the other residues are varied. 

When the micro-protein is of the type featuring one or 
more or helices and 0 strands, the set of potential amino 
acid modifications at any given position is picked to favor 
those which are less likely to disrupt the secondary 
structure at that position. Since the number of possibil- 
ities at each variable amino acid is more limited, the total 
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number of variable amino acids may be greater without 
altering the sampling efficiency of the selection process. 

For class III micro -proteins, preferably not more than 
20 and more preferably 5-10 codons will be variegated. 
5 However, if diffuse mutagenesis is employed, the number of 
codons which are variegated can be higher. 

While variegation normally will involve the substitu- 
tion of one amino acid for another at a designated variable 
codon, it may involve the insertion or deletion of amino 

10 acids as well. 

T T -r.B. Ident^fi nation nf Residues to be Varied 

We now consider the principles that guide our choice of 
residues of the IPBD to vary. A key concept is that only 
structured proteins exhibit specific binding, can bind 

15 to a particular chemical entity to the exclusion of most 
others. Thus the residues to be varied are chosen with an 
eye to preserving the underlying IPBD structure. 
Substitutions that prevent the PBD from folding will cause 
GPs carrying those genes to bind indiscriminately so that 

20 they can easily be removed from the population. 
Substitutions of amino acids that are exposed to solvent are 
less likely to affect the 3D structure than are 
substitutions at internal loci. (See PAKU86, RBTD88a, 
EISE85, SCHU79, pl69-171 and CRET84, p239-245, 314-315). 

25 internal residues are frequently conserved and the amino 
acid type cannot be changed to a significantly different 
type without substantial risk that the protein structure 
will be disrupted. Nevertheless, some conservative changes 
of internal residues, such as I to L or P to Y, are 

30 tolerated. Such conservative changes subtly affect the 
placement and dynamics of adjacent protein residues and such 
"fine tuning" may be useful once an SBD is found. Inser- 
tions and deletions are more readily tolerated in loops than 
elsewhere. (TH0R88) . 
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Data about the IPBD and the target that are useful in 
deciding which residues to vary in the variegation cycle 
include: 1) 3D structure, or at least a list of residues on 
the surface of the IPBD, 2) list of sequences homologous to 
5 IPBD, and 3) model of the target molecule or a stand-in for . 
the target. 

TT t n n oro ^mna m- flnhifcituH on Set for ^do Parental 

Residue 

Having picked which residues to vary, we now decide the 

10 range of amino acids to allow at each variable residue. The 
total level of variegation is the product of the number of 
variants at each varied residue. Each varied residue can 
have a different scheme of variegation, producing 2 to 20 
different possibilities. The set of amino acids which are 

15 potentially encoded by a given variegated codon are called 
its "substitution set". 

The computer that controls a DNA synthesizer, such as 
the Milligen 7500, can be programmed to synthesize any base 
of an oligo-nt with any distribution of nts by taking some 

20 nt substrates (fi^ nt phosphoramidites) from each of two or 
more reservoirs. Alternatively, nt substrates can be mixed 
in any ratios and placed in one of the extra reservoir for 
so called "dirty bottle" synthesis. Each codon could be 
programmed differently. The "mix" of bases at each 

25 nucleotide position of the codon determines the relative 
frequency of occurrence of the different amino acids encoded 

by that codon. 

Simply variegated codons are those in which those 
nucleotide positions which are degenerate are obtained from 
30 a mixture of two or more bases mixed in equimolar propor- 
tions. These mixtures are described in this specification 
by means of the standardized "ambiguous nucleotide" code, 
in this code, for example, in the degenerate codon "SNT", 
-S" denotes an equimolar mixture of bases G and C, "N* , an 
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equimolar mixture of all four bases, and "T", the single 
invariant base thymidine. 

Complexly variegated codons are those in which at least 
one of the three positions is filled by a base from an other 
5 than equimolar mixture of two of more bases. 

Either simply or complexly variegated codons may be 
used to achieve the desired substitution set. 

If we have no information indicating that a particular 
amino acid or class of amino acid is appropriate, we strive 
10 to substitute all amino acids with equal probability because 
representation of one mini -protein above the detectable 
level is wasteful. Equal amounts of all four nts at each 
position in a codon (NNN) yields the amino acid distribution 
in which each amino acid is present in proportion to the 
15 number of codons that code for it. This distribution has 
the disadvantage of giving two basic residues for every 
acidic residue. In addition, six times as much R, S, and L 
as W or M occur. If five codons are synthesized with this 
distribution, each of the 243 sequences encoding some 
20 combination of L, R, and S are 7776-times more abundant than 
each of the 32 sequences encoding some combination of W and 
M. To have five Ws present at detectable levels, we must 
have each of the (L,R,S) sequences present in 7776-fold 
excess . 

25 Particular amino acid residues can influence the 

tertiary structure of a defined polypeptide in several ways, 
including by: 

a) affecting the flexibility of the polypeptide main 
chain, 

30 b) adding hydrophobic groups, 

c) adding charged groups, 

d) allowing hydrogen bonds, and 

e) forming cross -links, such as disulfides, chelation to 
metal ions, or bonding to prosthetic groups. 



WO 92/15677 



PCT/US92/01456 



32 

Lundeen (LDND86) has tabulated the frequencies of amino 
acids in helices, 0 strands, turns, and coil in proteins of 
known 3D structure and has distinguished between CYSs having 
free thiol groups and half cystines. He reports that free 
5 CYS is found most often in helixes while half cystines are 
found more often in j8 sheets. Half cystines are, however, 
regularly found in helices. Pease et al*. (PEAS90) 
constructed a peptide having two cystines; one end of each 
is in a very stable a helix. Apamin has a similar structure 
10 (WEMM83, PEAS88). 
T?lA*rihilitv; 

GLY is the smallest amino acid, having two hydrogens 
attached to the C a . Because GLY has no C„ it confers the 
most flexibility on the main chain. Thus GLY occurs very 

15 frequently in reverse turns, particularly in conjunction 
with PRO, ASP, ASN, SER, and THR. 

The amino acids ALA, SER, CYS, ASP, ASN, LEU, MET, PHE, 
TYR, TRP, ARG, HIS, GLU, GLN, and LYS have unbranched 0 
carbons. Of these, the side groups of SER, ASP, and ASN 

20 frequently make hydrogen bonds to the main chain and so can 
take on main- chain conformations that are energetically 
unfavorable for the others. VAL, ILE, and THR have branched 
/3 carbons which makes the extended main- chain conformation 
more favorable. Thus VAL and ILE are most often seen in 0 

25 sheets. Because the side group of THR can easily form 
hydrogen bonds to the main chain, it has less tendency to 

exist in a /3 sheet. 

The main chain of proline is particularly constrained 
by the cyclic side group. The * angle is always close to - 
30 60°. Most prolines are found near the surface of the 
protein. 

LYS and ARG carry a single positive charge at any pH 
below 10.4 or 12.0, respectively. Nevertheless, the 



WO 92/15677 



PCT/US92/01456 



33 



xylene groups, four aad three respectively, of these 
amino acids are capable of hydrophobic interactions. The 
guanidinium group of »*G is capable of donating ^ five 
hydrogens simultaneously, while the amino group of LYS can 
5 donate only three. Furthermore, the geometries of these 
groups is quite different, so that these groups are often 

not interchangeable. 

MP and OIU carry a single negative charge at any pH 
above -4.5 and 4.6, respectively. Because ASP has but one 

L0 methylene group, few hydrophobic Interactions are P°«"*- 
The geometry of ASP lends itself to forming hydrogen bonds 
to main-chain nitrogens which is consistent with ASP being 
found very often in reverse turns and at the beginning of 
helices. GIB is more often found in a helices and 

U particularly in the amino- terminal portion of these helices 
because the negative charge of the side group has a 
stabilizing interaction with the helix dipol. (NICH88, 
SALI88) . 

HIS has an ionization P K in the physiological range, 
20 ^ 6.2. This P K can be alter,* by the P™*«"* ° f 
charged groups or of hydrogen donators or acceptors. HIS is 
capable of forming bonds to metal ions such as zinc, copper, 
and iron. 
TT Yrl T '9 t T en bonds; 

25 Aside from the charged amino acids, SER, THR, ASN, GLN, 

TYR, and TRP can participate in hydrogen bonds. 
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The most important form of cross link is the disulfide 
bond formed between the thiols of CYS residues. In a 
suitably oxidizing environment, these bonds form 
spontaneously. These bonds can greatly ^ iliz * a 
particular conformation of a protein or mini-protein. When 
a mixture of oxidized and reduced thiol reagents are 
present, exchange reactions take place that allow the most 
stable conformation to predominate. Concerning disulfides 



WO 92/15677 



PCT/US92/01456 



34 

in proteins and peptides, see also KATZ9 0 , MA.TS89 , PERR84, 
PERR86, SAUE86, WELLS 6, JANA89, HORV89, KISH85, and SCHN86. 

Other cross links that form without need of specific 
enzymes include: 
5 1) (CYS) 4 :Fe Rubredoxin (in CREI84, P. 376) 

2) (CYS) 4 :Zn Aspartate Trans carbamylase (in 

CREI84, P. 376) and Zn-fingers 
(HARD90) 

3) (HIS) 2 (MET) (CYS) :Cu Azurin (in CREI84, P. 376) and 
20 Basic "Blue" Cu Cucumber protein 

(GDSS88) 

4) (HIS) 4 :Cu CuZn superoxide dismutase 

5) (CYS) 4 : (Fe 4 S 4 ) Perredoxin (in CREI84, P. 376) 

6) (CYS) 2 (HIS) 2 :Zn Zinc-fingers (GIBS88, SI3MM91) 
15 7) (CYS) 3 (HIS) :Zn Zinc-fingers (GAUS87, GIBS88) 

Cross links having (HIS) 2 (MET) (CYS) :Cu has the potential 
advantage that HIS and MET can not form other cross links 
without Cu. 

Simply Variegated Codons 

20 The following simply variegated codons are useful 

because they encode a relatively balanced set of amino 
acids : 

1) SNT which encodes the set [L,P,H,R,V,A,D,G] : a) one 
acidic (D) and one basic (R) , b) both aliphatic (L,V) 

25 and aromatic hydrophobics (H) , c) large (L,R,H) and 

small (G,A) side groups, d) rigid (P) and flexible (G) 
amino acids, e) each amino acid encoded once. 

2) RNG which encodes the set [M,T,K,R,v,A,E,G] : a) one 
acidic and two basic (not optimal, but acceptable) , b) 

30 hydrophilics and hydrophobics, c) each amino acid 

encoded once. 

3) RMG which encodes the set [T,K,A,E] : a) one acidic, one 
basic, one neutral hydrophilic, b) three favor a 
helices, c) each amino acid encoded once. 
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4) VNT which encodes the set [L,P,H,R,I,T,N,S,V,A,D,G] : 
a) one acidic, one basic, b) all classes: charged, 
neutral hydrophilic, hydrophobic, rigid and flexible, 
etc. . c) each amino acid encoded once. 

5) RRS which encodes the set IN,S,K,R,D,E,G»] : a) two 
acidics, two basics, b) two neutral hydrophilics, c) 
only glycine encoded twice. 

6) NNT which encodes the set [F,S,Y,C,L,P,H,R,I.T,N,V,A- 
,D,6] : a) sixteen DNA sequences provide fifteen dif- 
ferent amino acids; only serine is repeated, all others 
are present in equal amounts (This allows very 
efficient sampling of the library.) , b) there are equal 
numbers of acidic and basic amino acids (D and R, once 
each) , c) all major classes of amino acids are present: 
acidic, basic, aliphatic hydrophobic, aromatic 
hydrophobic, and neutral hydrophilic. 

7) NNG, which encodes the set [L* ,R* ,S,W,P,Q,M,T,K,V,A, - 
B,G, stop]: a) fair preponderance of residues that 
favor formation of a-helices [L,M,A,Q,K,E; and, to a 
lesser extent, S,R,T]; b) encodes 13 different amino 
acids. (VHG encodes a subset of the set encoded by NNG 
which encodes 9 amino acids in nine different DNA 
sequences, with equal acids and bases, and 5/9 being a 
helix- favoring . ) 

25 For the initial variegation, NNT is preferred, in most 

cases. However, when the codon is encoding an amino acid to 
be incorporated into an ct helix, NNG is preferred. 

Below, we analyze several simple variegations as to the 
efficiency with which the libraries can be sampled. 

Libraries of random hexapeptides encoded by (NNK) 6 have 
been reported (SC0T90, CWIR90) . Table 130 shows the 
expected behavior of such libraries. NNK produces single 
codons for PHE, TYR, CYS, TRP, HIS, GLN, ILE, MET, ASN, LYS, 
ASP, and GLU (a set) ; two codons for each of VAL, ALA, PRO, 
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THR, and GliY (* set) ; and three codons for each of LEU, ARG, 
and SER (Q set) . We have separated the 64,000,000 possible 
sequences into 28 classes, shown in Table 130A, based on the 
number of amino acids from each of these sets. The largest 
5 class is *Qaaaa with -14.6% of the possible sequences. 
Aside from any selection, all the sequences in one class 
have the same probability of being produced. Table 13 0B 
shows the probability that a given DNA sequence taken from 
the (NNK) 6 library will encode a hexapeptide belonging to 

10 one of the defined classes; note that only -6.3% of DNA 
sequences belong to the SQoaaa class. 

Table 13 0C shows the expected numbers of sequences in 
each class for libraries containing various numbers of 
independent transformants ( vjS t 10 6 , 3-10 6 , 10 7 , 3-10 7 , 10«, 

15 3 -10 s , 10', and 3-10 9 ). At 10 6 independent transformants 
(ITS)', we expect to see 56% of the QQQQQQ class, but only 
0.1% of the aaoaora class. The vast majority of sequences 
seen come from classes for which less than 10% of the class 
is sampled. Suppose a peptide from, for example, class 

20 **QQaa is isolated by fractionating the library for binding 
to a target. Consider how much we know about peptides that 
are related to the isolated sequence. Because only 4% of 
the **QQora class was sampled, we can not conclude that the 
amino acids from the 0 set are in fact the best from the Q 

25 set. We might have LEU at position 2, but ARG or SER could 
be better. Even if we isolate a peptide of the OOQQQQ 
class, there is a noticeable chance that better members of 
the class were not present in the library. 

With a library of 10 7 ITs, we see that several classes 

30 have been completely sampled, but that the aaaaaa class is 
only 1.1% sampled. At 7.6-10 7 ITs, we expect display of 50% 
of all amino-acid sequences, but the classes containing 
three or more amino acids of the a set are still poorly 
sampled. To achieve complete sampling of the (NNK) 6 library 
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requires about 3-10 9 TTs, 10 -fold larger than the largest 
(NNK) 6 library so far reported. 

Table 131 shows expectations for a library encoded by 
(NNT) 4 (NNG) 2 . The expectations of abundance are independent 
of the order of the codons or of interspersed unvaried 
codons. This library encodes 0.133 times as many amino-acid 
sequences, but there are only 0.0165 times as many DNA 
sequences. Thus 5.0-10 7 ITs (i^ 60-fold fewer than 
required for (NNK) 6 ) gives almost complete sampling of the 
library. The results would be slightly better for (NUT) 6 
and slightly, but not much, worse for (NNG) 6 . The 
controlling factor is the ratio of DNA sequences to amino- 
acid sequences. 

Table 132 shows the ratio of ttDNA sequences/#AA 
15 sequences for codons NNK, NNT, and NNG. For NNK and NNG, we 
have assumed that the PBD is displayed as part of an 
essential gene, such as gene III in Pf phage, as is 
indicated by the phrase "assuming stops vanish" . It is not 
in any way required that such an essential gene be used. If 
20 a non-essential gene is used, the analysis would be slightly 
different; sampling of NNK and NNG would be slightly less 
efficient. Note that (NNT) 6 gives 3. 6- fold more amino-acid 
sequences than (NNK) 5 but requires 1.7- fold £g*SE DNA 
sequences. Note also that (NNT) 7 gives Jadfifi as many amino- 
25 acid sequences as (NNK) 6 , but 3. 3 -fold fejtSE DNA sequences. 

Thus, while it is possible to use a simple mixture 
(NNS, NNK or NNN) to obtain at a particular position all 
twenty amino acids, these simple mixtures lead to a highly 
biased set of encoded amino acids. This problem can be 
overcome by use of complexly variegated codons. 
Complexly Variegated Codons 

The nt distribution ("fxS") within the codon that 
allows all twenty amino acids and that yields the largest 
ratio of abundance of the least favored amino acid (lfaa) to 
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that of the most favored amino acid (mfaa) , subject to the 
constraints of equal abundances of acidic and basic amino 
acids, least possible number of stop codons, and, for 
convenience, the third base being T or G, is shown in Table 
5 10A and yields DNA molecules encoding each type of amino 
acid with the abundances shown. Other complexly variegated 
codons are obtainable by relaxing one or more constraints. 

Note that this chemistry encodes all twenty amino 
acids, with acidic and basic amino acids being equiprobable , 

10 and the most favored amino acid (serine) is encoded only 
2.454 times as often as the least favored amino acid (tryp- 
tophan) . The "fxS" vg codon improves sampling most for 
peptides containing several of the amino acids [F,Y,C,W,H- 
,Q,I,M,N,K,D,E] for which NNK or NNS provide only one codon. 

15 Its sampling advantages are most pronounced when the library 

is relatively small. 

The results of omitting the requirements of equality of 
acids and bases and minimizing stop codons are shown in 
Table 10B. 

20 The advantages of an NNT codon are discussed elsewhere 

in the present application. Unoptimized NNT provides 15 
amino acids encoded by only 16 DNA sequences. It is 
possible to improve on NNT with the distribution shown in 
Table 10C, which gives five amino acids (SER, LEU, HIS, VAL, 

25 ASP) in very nearly equal amounts. A further eight amino 
acids (PHE, TYR, ILE, ASN, PRO, ALA, ARG, GLY) are present 
at 78% the abundance of SER. THR and CYS remain at half the 
abundance of SER. When variegating DNA for disulfide -bonded 
micro-proteins, it is often desirable to reduce the 

30 prevalence of CYS. This distribution allows 13 amino acids 
to be seen at high level and gives no stops; the optimized 
fxS distribution allows only 11 amino acids at high 
prevalence. 

The NNG codon can also be optimized. Table 10D shows 
35 an approximately optimized ( [ALA] - [ARG] ) NNG codon. There 
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are, under this variegation, four equally most favored amino 
acids: LEU, ARG, ALA, and GLU. Note that there is one 
acidic and one basic amino acid in this set. There are two 
equally least favored amino acids: TRP and MET. The ratio 
of Ifaa/mfaa is 0.5258. If this codon is repeated six 
times, peptides composed entirely of TRP and MET are 2% as 
common as peptides composed entirely of the most favored 
amino acids. We refer to this as -the prevalence of 
(TRP/MET) 6 in optimized NNG 6 vgDNA". 

When synthesizing vgDNA by the -dirty bottle" method, 
it is sometimes desirable to use only a limited number of 
mixes. One very useful mixture is called the -optimized NNS 
mixture" in which we average the first two positions of the 
fxS mixture: T, - 0.24, C, = 0.17, A, = 0.33, G, - 0.26, the 
second position is identical to the first, C, - G, = 0.5. 
This distribution provides the amino acids ARG, SER, LEU, 
GLY, VAL, THR, ASN, and LYS at greater than 5% plus ALA, 
ASP, GLU, ILE, MET, and TYR at greater than 4%. 

An additional complexly variegated codon is of 
interest. This codon is identical to the optimized NNT 
codon at the first two positions and has T:G::90:10 at the 
third position. This codon provides thirteen amino acids 
(ALA, ILE, ARG, SER, ASP, LEU, VAL, PHE, ASN, GLY, PRO, TYR, 
and HIS) at more than 5.5%. THR at 4.3% and CYS at 3.9% are 
more common than the LFAAs of NNK (3.125%) . The remaining 
five amino acids are present at less than 1%. This codon 
has the feature that all amino acids are present; sequences 
having more than two of the low-abundance amino acids are 
rare When we isolate an SBD using this codon, we can be 
reasonably sure that the first 13 amino acids were tested at 
each position. A similar codon, based on optimized NNG, 

could be used. 

Table 10E shows some properties of an unoptimized NNS 
(or NNK) codon. Note that there are three equally most- 
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favored amino acids: ARG, LEU, and SER. There are also 
twelve equally least favored amino acids: PHE, ILE, MET, 
TYR, HIS, GLN, ASN, LYS, ASP, GLU, CYS, and TRP. Five amino 
acids (PRO, THR, MA, VAL, GLY) fall in between. Note that 
5 a six- fold repetition of NNS gives sequences composed of the 
amino acids [PHE, ILE, MET, TYR, HIS, GLN, ASN, LYS, ASP, 
GLU, CYS, and TRP] at only -0.1% of the sequences composed 
of [ARG, LEU, and SER] . Not only is this -20 -fold lower 
than the prevalence of (TRP /MET) 6 in optimized NNG 6 vgDNA, 

10 but this low prevalence applies to twelve amino acids. 
Diffuse Mutagenesis 

Diffuse Mutagenesis can be applied to any part of the 
protein at any time, but is most appropriate when some 
binding to the target has been established. Diffuse 

15 Mutagenesis can be accomplished by spiking each of the pure 
nts activated for DNA synthesis (e^ nt-phosphoramidites) 
with a small amount of one or more of the other activated 
nts. Preferably, the level of spiking is set so that only 
a small percentage (1% to .00001%, for example) of the final 

20 product will contain the initial DNA sequence. This will 
insure that many single, double, triple, and higher 
mutations occur, but that recovery of the basic sequence 
will be a possible outcome. 

TT T ,p, special rnnflideraH nn* Relating to Variegation of 

25 M-ir-T-o-Prot »ing with PlHflenfcial Cysteines 

Several of the preferred simple or complex variegated 
codons encode a set of amino acids which includes cysteine. 
This means that some of the encoded binding domains will 
feature one or more cysteines in addition to the invariant 

30 disulfide-bonded cysteines. For example, at each NNT- 
encoded position, there is a one in sixteen chance of 
obtaining cysteine. If six codons are so varied, the 
fraction of domains containing additional cysteines is 0.33. 
Odd numbers of cysteines can lead to complications, see 



WO 92/15677 



PCI7US92/01456 



41 

Perry and Wetzel (PERR84) . On the other hand, many 
disulfide -containing proteins contain cysteines that do not 
form disulfides, e^cu trypsin. The possibility of unpaired 
cysteines can be dealt with in several ways: 
5 First, the variegated phage population can be passed 

over an immobilized reagent that strongly binds free thiols, 
such as SulfoLink (catalogue number 44895 H from Pierce 
Chemical Company, Rockford, Illinois, 61105). Another 
product from Pierce is TNB- Thiol Agarose (Catalogue Code 
10 20409 H) . BioRad sells Affi-Gel 401 (catalogue 153-4599) 
for this purpose. 

Second, one can use a variegation that excludes 

cysteines, such as: 

NHT that gives [F,S, Y,L,P,H,I,T,N,V,A,D] , 

15 VNS that gives 

[L»,P»,H,Q,R 3 ,I,M,T»,N,K,S,V»,A»,E,D,G»] , 

NNG that gives [L* ,S,W,P,Q,R* ,M,T,K,R,V,A,E,6,Stop] , 

SNT that gives [L,P,H,R,V,A,D,G] , 

RNG that gives [M,T,K,R,V,A,E,G] , 
20 RMS that gives [T,K,A,E], 

VNT that gives [L,P,H,R,I,T,N,S,V,A,D,G] , or 

RRS that gives [N,S,K,R,D,E,G»] . 
However, each of these schemes has one or more of the 
disadvantages, relative to NNT: a) fewer amino acids are 
25 allowed, b) amino acids are not evenly provided, c) acidic 
and basic amino acids are not equally likely) , or d) stop 
codons occur. Nonetheless, NNG, NHT, and VNT are almost as 
useful as NNT. NNG encodes 13 different amino acids and one 
stop signal. Only two amino acids appear twice in the 16- 
30 fold mix. 

Thirdly, one can enrich the population for binding to 
the preselected target, and evaluate selected sequences pfiSt 
hoc for extra cysteines. Those that contain more cysteines 
than the cysteines provided for conformational constraint 
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may be perfectly usable. It is possible that a disulfide 
linkage other than the designed one will occur. This does 
not mean that the binding domain defined by the isolated DNA 
sequence is in any way unsuitable. The suitability of the 
5 isolated domains is best determined by chemical and 
biochemical evaluation of chemically synthesized peptides. 

Lastly, one can block free thiols with reagents, such 
as Ellman's reagent, iodoacetate, or methyl iodide, that 
specifically bind free thiols and that do not react with 

10 disulfides, and then leave the modified phage in the 
population. It is to be understood that the blocking agent 
may alter the binding properties of the micro-protein; thus, 
one might use a variety of blocking reagent in expectation 
that different binding domains will be found. The 

15 variegated population of thiol -blocked genetic packages are 
fractionated for binding. If the DNA sequence of the 
isolated binding micro-protein contains an odd number of 
cysteines, then synthetic means are used to prepare micro - 
proteins having each possible linkage and in which the odd 

20 thiol is appropriately blocked. Nishiuchi (NISH82, NISH86, 
and works cited therein) disclose methods of synthesizing 
peptides that contain a plurality of cysteines so that each 
thiol is protected with a different type of blocking group. 
These groups can be selectively removed so that the 

25 disulfide pairing can be controlled. We envision using such 
a scheme with the alteration that one thiol either remains 
blocked, or is unblocked and then reblocked with a different 
reagent . 

ttt. f , Planning fh» Second and Later Rounds? Pfi Variegation 
30 The method of the present invention allows efficient 

accumulation of information concerning the amino-acid 
sequence of a binding domain having high affinity for a 
predetermined target. Although one may obtain a highly 
useful binding domain from a single round of variegation and 
35 affinity enrichment, we expect that multiple rounds will be 
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needed to achieve the highest possible affinity and 
specificity. 

If the first round of variegation results in some 
binding to the target, but the affinity for the target is 
still too low, further improvement may be achieved by 
variegation of the SBDs. Preferably, the process is 
progressive, i.e. each variegation cycle produces a better 
starting point for the next variegation cycle than the 
previous cycle produced. Setting the level of variegation 
such that the p pbd and many sequences related to the BPfed. 
sequence are present in detectable amounts ensures that the 
process is progressive. 

If the level of variegation is so high that the ppfrd 
sequence is present at such low levels that there is an 
appreciable chance that no transfonnant will display the 
PPBD, then the best SBD of the next round CQulfl be XQXBS. 
than the PPBD. At excessively high level of variegation, 
each round of mutagenesis is independent of previous rounds 
and there is no assurance of progressivity. This approach 
can lead to valuable binding proteins, but repetition of 
experiments with this level of variegation will not yield 
progressive results. Excessive variation is not preferred. 

Progressivity is not an all-or-nothing property. So 
long as most of the information obtained from previous 
; variegation cycles is retained and many different surfaces 
that are related to the PPBD surface are produced, the 
process is progressive. 

If the level of variegation in the previous variegation 
cycle was correctly chosen, then the amino acids selected to 
be in the residues just varied are the ones best determined. 
The environment of other residues has changed, so that it is 
appropriate to vary them again. Because there are often 
more residues of interest than can be varied simultaneously, 
we may continue by picking residues that either have never 
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been varied (highest priority) or that have not been varied 
for one or more cycles. 

Use of NNT or NNG variegated codons leads to very effi- 
cient sampling of variegated libraries because the ratio of 
(different amino-acid sequences) / (different DNA sequences) 
is much closer to unity than it is for NNK or even the 
optimized vg codon (fxS) . Nevertheless, a few amino acids 
are omitted in each case. Both NNT and NNG allow members of 
all important classes of amino acids: hydrophobxc, 
hydrophilic, acidic, basic, neutral hydrophilic, small, and 
large After selecting a binding domain, a subsequent 
variegation and selection may be desirable to achieve a 
higher affinity or specificity. During this second 
variegation, amino acid possibilities overlooked by the 
preceding variegation may be investigated. 

A few examples may be helpful. Suppose we obtained PRO 
using NNT. This amino acid is available with either NNT or 
NNG We can be reasonably sure that PRO is the best amino 
acid from the set [PRO, LEU, VAL, THR, ALA, ARG, GLY, PHE, 
TYR, CYS, HIS, ILE, ASN, ASP, SER] . We next might try a set 
that includes [PRO, TRP, GLN, MET, LYS, GLU] . The set 
allowed by NNG is the preferred set. 

What if we obtained HIS instead? Histidine is aromatxc 
and fairly hydrophobic and can form hydrogen bonds to and 
from the imidazole ring. Tryptophan is hydrophobic and 
aromatic and can donate a hydrogen to a suitable acceptor 
and was excluded by the NNT codon. Methionine was also 
excluded and is hydrophobic. Thus, one preferred course is 
to use the variegated codon HDS that allows [HIS, GLN, ASN, 
LYS, TYR, CYS, TRP, ARG, SER, GLY, <stop>] . 

If the first round of variegation is entirely 
unsuccessful, a different pattern of variegation should be 
used. For example, if more than one interaction set can be 
defined within a domain, the residues varied in the next 
round of variegation should be from a different set than 
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that probed in the initial variegation. If repeated 
failures are encountered, one may switch to a different 
IPBD. 

IV. DISPLAY STRATEGY* DISPLAYING FOREIGN BINDING DOMAINS ON 
TEE SURFACE OF A "GENETIC PACKAGE" 

TV. A. Gener al Requirements for Genetic Packages 

In order to obtain the display of a multitude of 
different though related potential binding domains, appli- 
cants generate a heterogeneous population of replicable 
genetic packages each of which comprises a hybrid gene 
including a first DNA sequence which encodes a potential 
binding domain for the target of interest and a second DNA 
sequence which encodes a display means, such as an outer 
surface protein native to the genetic package but not 
natively associated with the potential binding domain (or 
the parental binding domain to which it is related) which 
causes the genetic package to display the corresponding 
chimeric protein (or a processed form thereof) on its outer 
surface . 

The component of a population that exhibits the desired 
binding properties may be quite small, for example, one in 
10 6 or less. Once this component of the population is 
separated from the non-binding components, it must be 
possible to amplify it. Culturing viable cells is the most 
powerful amplification of genetic material known and is 
preferred. Genetic messages can also be amplified in vitro , 
e.g. by PCR, but this is not the most preferred method. 

Preferably, the GP can be: l) genetically altered with 
reasonable facility to encode a potential binding domain, 2) 
maintained and amplified in culture, 3) manipulated to 
display the potential binding protein domain where it can 
interact with the target material during affinity 
separation, and 4) affinity separated while retaining the 
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genetic information encoding the displayed binding domain xn 
recoverable form. Preferably, the GP remains viable after 
affinity separation. Preferred GPs are vegetative bacterial 
cells, bacterial spores and, especially, bacterial DNA 
viruses. Bukaryotic cells and eukaryotic viruses may be 
used as genetic packages, but are not preferred. 

When the genetic package is a bacterial cell, or a 
phage which is assembled periplasmically, the display means 
has two components. The first component is a secretion 
signal which directs the initial expression product to the 
inner membrane of the cell (a host cell when the package is 
a phage) . This secretion signal is cleaved off by a signal 
peptidase to yield a processed, mature, potential binding 
protein. The second component is an outer surface transport 
signal which directs the package to assemble the processed 
protein into its outer surface. Preferably, this outer 
surface transport signal is derived from a surface protexn 
native to the genetic package. 

For example, in a preferred embodiment, the hybrid gene 
comprises a DNA encoding a potential binding domain operably 
linked to a signal sequence (*ju. f the signal sequences of 
the bacterial E&oA or fela genes or the signal sequence of 
M13 phage genelll) and to DNA encoding a coat protexn 
fe . q . . the ML3 gene III or gene VIII proteins) of a 
25 filamentous phage (e^, M13) - The expression product is 
transported to the inner membrane (lipid bilayer) of the 
host cell, whereupon the signal peptide is cleaved off to 
leave a processed hybrid protein. The C-terminus of the 
coat protein-like component of this hybrid protein is 
30 trapped in the lipid bilayer, so that the hybrid protein 
does not escape into the periplastic space. (This xs 
typical of the wild- type coat protein.) As the single- 
stranded DNA of the nascent phage particle passes into the 
periplasms space, it collects both wild- type coat protein 
35 and the hybrid protein from the lipid bilayer. The hybrid 
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protein is thus packaged into the surface sheath of the 
filamentous phage, leaving the potential binding domain 
exposed on its outer surface. (Thus, the filamentous phage, 
not the host bacterial cell, is the "replicable genetic 
package" in this embodiment.) 

If a secretion signal is necessary for the display of 
the potential binding domain, in an especially preferred 
embodiment the bacterial cell in which the hybrid gene is 
expressed is of a "secretion-permissive" strain. 

When the genetic package is a bacterial spore, or a 
phage (such as *X174 or X) whose coat is assembled 
intracellular^, a secretion signal directing the expression 
product to the inner membrane of the host bacterial cell is 
unnecessary. In these cases, the display means is merely 
the outer surface transport signal, typically a derivative 
of a spore or phage coat protein. 

Preferred OSPs for several GPs are given in Table 2. 
References to QSfidLEfcSi fusions in this section should be 
taken to apply, thT^" mutandis, to PBP-Pfrfl and QSS^bM 
20 fusions as well. 

TV.B. Phaqoff for Use as GPs; 

Periplasmically assembled phage are preferred when the 
IPBD is a disulfide-bonded micro-protein, as such IPBDs may 
not fold within a cell (these proteins may fold after the 
phage is released from the cell) . Intracellular^ assembled 
phage are preferred when the IPBD needs large or insoluble 
prosthetic groups (such as Fe 4 S 4 clusters) , since the IPBD 
may not fold if secreted because the prosthetic group is 
lacking in the periplasm. 

When variegation is introduced, multiple infections 
could generate hybrid GPs that carry the gene for one PBD 
but have at least some copies of a different PBD on their 
surfaces; it is preferable to minimize this possibility by 
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infecting cells with phage under conditions resulting in a 
low multiple- of -infection (MOD . 

Bacteriophages are excellent candidates for GPs because 
there is little or no enzymatic activity associated with 
intact mature phage, and because the genes are inactive 
outside a bacterial host, rendering the mature phage 
particles metabolically inert. 

For a given bacteriophage, the preferred OSP is usually 
one that is present on the phage surface in the largest 
number of copies. Nevertheless, an OSP such as KL3 gill 
protein (5 copies/phage) may be an excellent choice as OSP 
to cause display of the PBD. 

It is preferred that the wild- type qsb. gene be 
preserved. The ipbd gene fragment may be inserted either 
15 into a second copy of the recipient am gene or into a novel 
engineered sac gene. It is preferred that the psp-ipfefl gene 
be placed under control of a regulated promoter. 

The user must choose a site in the candidate OSP gene 
for inserting a ipbd gene fragment. The coats of most 
20 bacteriophage are highly ordered. In such bacteriophage, it 
is important to retain in engineered OSP-IPBD fusion 
proteins those residues of the parental OSP that interact 
with other proteins in the virion. For M13 gVIII, we 
preferably retain the entire mature protein, while for M13 
25 gill, it might suffice to retain the last 100 residues 
(BASS90) (or even fewer). Such a truncated gill protein 
would be expressed in parallel with the complete gill 
protein, as gill protein is required for phage infectivity. 

The filamentous phage, which include M13, fl, fd, Ifl, 
Ike, Xf, Pfl, and Pf3, are of particular interest. The 
major coat protein is encoded by gene VIII. The 50 amino 
acid mature gene VIII coat protein is synthesized as a 73 
amino acid precoat (ITOK79) . The first 23 amino acids 
constitute a typical signal -sequence which causes the 
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nascent polypeptide to be inserted into the inner cell 
membrane . 

An coli signal peptidase (SP-I) recognizes amino 
acids 18, 21, and 23, and, to a lesser extent, residue 22, 
5 and cuts between residues 23 and 24 of the precoat (KUHN85a, 
KUHN85b, 0LIV87) . After removal of the signal sequence, the 
amino terminus of the mature coat is located on the 
periplasmic side of the inner membrane; the carboxy terminus 
is on the cytoplasmic side. About 3000 copies of the mature 
10 50 amino acid coat protein associate side-by- side in the 

inner membrane. 

The sequence of gene VIII is known, and the amino acid 
sequence can be encoded on a synthetic gene, using 3,acUV5 , 
promoter and used in conjunction with the Lad« repressor. 

15 The lacUVS promoter is induced by IPTG. Mature gene VIII 
protein makes up the sheath around the circular ssDNA. The 
3D structure of fl virion is known at medium resolution; the 
amino terminus of gene VIII protein is on surface of the 
virion and is therefore a preferred atttachment site for the 

20 potential binding domain. A few modifications of gene J£IXI 
have been made and are discussed below. The 2D structure of 
ML3 coat protein is implicit in the 3D structure. Mature 
M13 gene VIII protein has only one domain. 

We have constructed a tripartite gene comprising: 

25 1) DNA encoding a signal sequence directing secretion of 

parts (2) and (3) through the inner membrane, 

2) DNA encoding the mature BPTI sequence, and 

3) DNA encoding the mature M13 gVIII protein. 

This gene causes BPTI to appear in active form on the 
30 surface of M13 phage. 

The amino-acid sequence of M13 pre-coat (SCHA78) , 

called AA_seql, is 
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AA seql 

112||2 3344 5 
5050 \/5 0505 0 
MKKSLVLKASVAVATLVPMLSFAAEGDDPAKAAFNSLQASATEYIGyAWA 



5 6 6 7 7 
5 0 5 0 3 
MVWIVGATIGIKLFKKFTSKAS 



The best site for inserting a novel protein domain into M13 
CP is after A23 because SP-I cleaves the precoat protein 
after A23, as indicated by the arrow. Proteins that can be 
secreted will appear connected to mature M13 CP at its amino 
terminus. Because the amino terminus of mature M13 CP is 
located on the outer surface of the virion, the introduced 
domain will be displayed on the outside of the virion. The 
uncertainty of the mechanism by which M13CP appears in the 
20 lipid bilayer raises the possibility that direct insertion 
of feafci into gene YHI may not yield a functional fusion 
protein. It may be necessary to change the signal sequence 
. of the fusion to, for example, the phoA, signal sequence 

(MKQSTIALALLPLLFTPVTKA ) (MARX91) . Marks fit SJ^ 

25 (MARK86) showed that the pho A signal peptide could direct 
mature BPTI to the JL_ £Qli periplasm. 

Another vehicle for displaying the IPBD is by 
expressing it as a domain of a chimeric gene containing part 
or all of gene HI. This gene encodes one of the minor coat 
30 proteins of M13. Genes VI, VII, and IX also encode minor 
coat proteins. Each of these minor proteins is present in 
about 5 copies per virion and is related to morphogenesis or 
infection. In contrast, the major coat protein is present 
in more than 2500 copies per virion. The gene VI, VII, and 
35 IX proteins are present at the ends of the virion; these 
three proteins are not post-translationally processed 
(RASC86) . 
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The single- stranded circular phage DNA associates with 
about five copies of the gene III protein and is then 
extruded through the patch of membrane- associated coat 
protein in such a way that the DNA is encased in a helical 
5 sheath of protein (WEBS78) . The DNA does not base pair 
(that would impose severe restrictions on the virus genome) ; 
rather the bases intercalate with each other independent of 
sequence . 

Smith (SMIT85) and de la Cruz fit al^. (DELA88) have 

10 shown that insertions into gene III cause novel protein 
domains to appear on the virion outer surface. The mini- 
protein's gene may be fused to gene III at the site used by 
Smith and by de la Cruz fit aL^, at a codon corresponding to 
another domain boundary or to a surface loop of the protein, 

15 or to the amino terminus of the mature protein. 

All published works use a vector containing a single 
modified gene III of fd. Thus, all five copies of gill are 
identically modified. Gene III is quite large (1272 b.p. or 
about 20% of the phage genome) and it is uncertain whether 

20 a duplicate of the whole gene can be stably inserted into 
the phage. Furthermore, all five copies of gill protein are 
at one end of the virion. When bivalent target molecules 
(such as antibodies) bind a pentavalent phage, the resulting 
complex may be irreversible. Irreversible binding of the GP 

25 to the target greatly interferes with affinity enrichment of 
the GPs that carry the genetic sequences encoding the novel 
polypeptide having the highest affinity for the target. 

To reduce the likelihood of formation of irreversible 
complexes, we may use a second, synthetic gene that encodes 

30 carboxy- terminal parts of III; the carboxy- terminal parts of 
the gene III protein cause it to assemble into the phage. 
For example, the final 29 residues (starting with the 
arginine specified by codon 398) may be enough to cause a 
fusion protein to assemble into the phage. Alternatively, 

35 one might include the final globular domain of mature gill 



WO 92/15677 



PCT/US92/01456 



10 



52 

protein, viz. the final 150 to 160 amino acids of gene III 
(BASS90) . We might, for example, engineer a gene that 
consists of (from 5' to 3'): 

1) a promoter (preferably regulated) , 

2) a ribosome -binding site, 

3) an initiation codon, 

4) a functional signal peptide directing secretion of 
parts (5) and (6) through the inner membrane, 

5) DNA encoding an IPBD, 

6) DNA encoding residues 275 through 424 of M13 gill 
protein, 

7) a translation stop codon, and 

8) (optionally) a transcription stop signal. 

We leave the wild-type gene HI so that some unaltered gene 
15 III protein will be present. Alternatively, we may use gene 
VIII protein as the OSP and regulate the ogp;;jpbd fusion so 
that only one or a few copies of the fusion protein appear 
on the phage. 

M13 gene VI, VII, and IX proteins are not processed 
20 after translation. The route by which these proteins are 
assembled into the phage have not been reported. These 
proteins are necessary for normal morphogenesis and 
infectivity of the phage. Whether these molecules (gene VI 
protein, gene VII protein, and gene IX protein) attach 
25 themselves to the phage: a) from the cytoplasm, b) from the 
periplasm, or c) from within the lipid bilayer, is not 
known. One could use any of these proteins to introduce an 
IPBD onto the phage surface by one of the constructions: 
1) ipb_d::BmSB' 
30 2) pmcp: :lBbd, 

3) signal : :iPbfl: :BffiSB> and 

4) signal : :isbsb: :j 5 pbj|. 

where ipbd represents DNA coding on expression for the 
initial potential binding domain; pmsE represents DNA coding 
35 for one of the phage minor coat proteins, VI, VII, and IX; 
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Hicmai represents a functional secretion signal peptide, 
such as the signal (MKQSTIALALLPLLFTPVTKA) ; and " : : ■ 

represents in- frame genetic fusion. The indicated fusions 
are placed downstream of a known promoter, preferably a 
5 regulated promoter such as lacUVS, tac, or £J32- Fusions (l) 
and (2) are appropriate when the minor coat protein attaches 
to the phage from the cytoplasm or by autonomous insertion 
into the lipid bilayer. Fusion (1) is appropriate if the 
amino terminus of the minor coat protein is free and (2) is 
10 appropriate if the carboxy terminus is free. Fusions (3) 
and (4) are appropriate if the minor coat protein attaches 
to the phage from the periplasm or from within the lipid 
bilayer. Fusion (3) is appropriate if the amino terminus of 
the minor coat protein is free and (4) is appropriate if the 
15 carboxy terminus is free. 

Similar constructions could be made with other 
filamentous phage. Pf3 is a well known filamentous phage 
•that infects Pseudomonas aeruginosa cells that harbor an 
IncP-l plasmid. The major coat protein of PF3 is unusual in 
20 having no signal peptide to direct its secretion. The 
sequence has charged residues ASP 7 , ARG^, LYS^, and PHE M -COO" 
which is consistent with the amino terminus being exposed. 
Thus, to cause an IPBD to appear on the surface of Pf3, we 
construct a tripartite gene comprising: 
25 i) a signal sequence known to cause secretion in £*. 

aeruaenosa (preferably known to cause secretion of 
IPBD) fused in- frame to, 
2) a gene fragment encoding the IPBD sequence, fused in- 
frame to, 

30 3) DNA encoding the mature Pf3 coat protein. 

Optionally, DNA encoding a flexible linker of one to 10 
amino acids and/or amino acids forming a recognition site 
for a specific protease (e.g., Factor Xa) is introduced 
between the ipbd gene fragment and the Pf3 coat-protein 



WO 92/15677 



PCT/US92/01456 



54 

gene. This tripartite gene is introduced into Pf3 so that 
it does not interfere with expression of any P£3 genes. To 
reduce the possibility of genetic recombination, part (3) is 
designed to have numerous silent mutations relative to the 
5 wild- type gene. Once the signal sequence is cleaved off, 
the IPBD is in the periplasm and the mature coat protein 
acts as an anchor and phage -assembly signal. It does not 
matter that this fusion protein comes to rest in the lipid 
bilayer by a route different from the route followed by the 

10 wild- type coat protein. 

As described in W09 0/02809, other phage, such as 
bacteriophage SX174, large DNA phage such as X or T4, and 
even KNA phage, may with suitable adaptations and 
modifications be used as GPs. 

15 Tv r. Bacte^ *? f^-Hs an GeTifHc packages; 

One may choose any well -characterized bacterial strain 
which (1) may be grown in culture (2) may be engineered to 
display PBDs on its surface, and (3) is compatible with 
affinity selection. 

20 Among bacterial cells, the preferred genetic packages 

are gunnel la i-ynM murium, gasillufi SH&tilia, Pflfiudompnas 

^mainosa. Yifcrisi c^iesae, Klebsiella pssiimQBia, neisseria 

^nrrhoeae. Hsisssria meningitidis, Bacterpides aQdflSUS, 
Moraxella bovis , and especially Fpcheri.chj.a fiQli- The 
25 potential binding mini-protein may be expressed as an insert 
in a chimeric bacterial outer surface protein (OSP) . All 
bacteria exhibit proteins on their outer surfaces, i. cgli 
is the preferred bacterial GP and, for it, LamB is a 
preferred OSP. 

30 While most bacterial proteins remain in the cytoplasm, 

others are transported to the periplasmic space (which lies 
between the plasma membrane and the cell wall of gram- 
negative bacteria), or are conveyed and anchored to the 
outer surface of the cell. Still others are exported 

35 (secreted) into the medium surrounding the cell. Those 
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characteristics of a protein that are recognized by a cell 
and that cause it to be transported out of the cytoplasm and 
displayed on the cell surface will be termed "outer- surface 

transport signals". 
5 Gram-negative bacteria have outer-membrane proteins 

(OMP), that form a subset of OSPs. Many OMPs span the 
membrane one or more times. The signals that cause OMPs to 
localize in the outer membrane are encoded in the amino acid 
sequence of the mature protein. Outer membrane proteins of 
10 bacteria are initially expressed in a precursor form 
including a so-called signal peptide. The precursor protein 
is transported to the inner membrane, and the signal peptide 
moiety is extruded into the periplasmic space. There, it is 
cleaved off by a "signal peptidase", and the remaining 
15 "mature" protein can now enter the periplasm. Once there, 
other cellular mechanisms recognize structures in the mature 
protein which indicate that its proper place is on the outer 
membrane, and transport it to that location. 

It is well known that the DNA coding for the leader or 
20 signal peptide from one protein may be attached to the DNA 
sequence coding for another protein, protein X, to form a 
chimeric gene whose expression causes protein X to appear 
free in the periplasm. The use of export -permissive 
bacterial strains (LISS85, STAD89) increases the probability 
25 that a signal -sequence- fusion will direct the desired 
protein to the cell surface. 

OSP-IPBD fusion proteins need not fill a structural 
role in the outer membranes of Gram-negative bacteria 
because parts of the outer membranes are not highly ordered. 
30 For large OSPs there is likely to be one or more sites at 
which fiSE can be truncated and fused to ipbjl such that cells 
expressing the fusion will display IPBDs on the cell 
surface. Fusions of fragments of flEB genes with fragments 
of an x gene have led to X appearing on the outer membrane 
35 (CHAR88b,c, BENS 84, CLEM81) . When such fusions have been 
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made, we can design an os p- ipbd gene by substituting j-Pfrfl 
for 25 in the DNA sequence. Otherwise, a successful OMP-IPBD 
fusion is preferably sought by fusing fragments of the best 
om p to an i pbd . expressing the fused gene, and testing the 
5 resultant GPs for display- of- IPBD phenotype. We use the 
available data about the OMP to pick the point or points of 
fusion between omp and ipbd to maximize the likelihood that 
IPBD will be displayed. (Spacer DNA encoding flexible 
linkers, made, e.g. . of GLY, SER, and ASN, may be placed 

10 between the osp - and ipbjl-derived fragments to facilitate 
display.) Alternatively, we truncate figp_ at several sites 
or in a manner that produces osp fragments of variable 
length and fuse the qssl fragments to ipbd; cells expressing 
the fusion are screened or selected which display IPBDs on 

15 the cell surface. Freudl sL Si*. (FREU89) have shown that 
fragments of OSPs (such as OmpA) above a certain size are 
incorporated into the outer membrane. An additional 
alternative is to include short segments of random DNA in 
the fusion of ojnp. fragments to i pbd and then screen or 

20 select the resulting variegated population for members 
exhibiting the display- of -IPBD phenotype. 

In Sj. coli . the LamB protein is a well understood OSP 
and can be used. The JL_ coli LamB has been expressed in 
functional form in So. tvphimurium. 3L_ cholerae, and K*. pneu- 

25 monia . so that one could display a population of PBDs in any 
of these species as a fusion to HL. coli LamB. JL. pneumonia 
expresses a maltoporin similar to LamB (WEHM89) which could 
also be used. In P_,_ aeruginosa , the Dl protein (a 
homologue of LamB) can be used (TRIA88) . 

30 LamB is transported to the outer membrane if a 

functional N- terminal sequence is present; further, the 
first 49 amino acids of the mature sequence are required for 
successful transport (BENS84) . As with other OSPs, LamB of 
IL. coli is synthesized with a typical signal -sequence which 

35 is subsequently removed. Homology between parts of LamB 
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protein and other outer membrane proteins OmpC, OmpF, and 
PhoE has been detected (NIKA84) , including homology between 
LamB amino acids 39-49 and sequences of the other proteins. 
These subsequences may label the proteins for transport to 
5 the outer membrane. 

The amino acid sequence of LamB is known ( CLEMS l) , and 
a model has been developed of how it anchors itself to the 
outer membrane (Reviewed by, among others, BENZ88b) . The 
location of its maltose and phage binding domains are also 

10 known (HEIN88) . Using this information, one may identify 
several strategies by which a PBD insert may be incorporated 
into LamB to provide a chimeric OSP which displays the PBD 
on the bacterial outer membrane. 

When the PBDs are to be displayed by a chimeric trans - 

15 membrane protein like LamB, the PBD could be inserted into 
a loop normally found on the surface of the cell (ss^. 
BECK83, MRN086) . Alternatively, we may fuse a 5" segment of 
the QSSi gene to the iE&fi gene fragment; the point of fusion 
is picked to correspond to a surface -exposed loop of the OSP 

20 and the carboxy terminal portions of the OSP are omitted, 
in LamB, it has been found that up to 60 amino acids may be 
inserted (CHAR88b,c) with display of the foreign epitope 
resulting; the structural features of OmpC, OmpA, OmpF, and 
PhoE are so similar that one expects similar behavior from 

25 these proteins. 

It should be noted that while LamB may be characterized 
as a binding protein, it is used in the present invention to 
provide an OSTS; its binding domains are not variegated. 
Other bacterial outer surface proteins, such as OmpA, 

30 OmpC, OmpF, PhoE, and pilin, may be used in place of LamB 
and its homologues. OmpA is of particular interest because 
it is very abundant and because homologues are known in a 
wide variety of gram- negative bacterial species. Baker g£ 
al. (BAKE87) review assembly of proteins into the outer 

35 membrane of JL. coli and cite a topological model of OmpA 
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(V0GE86) that predicts that residues 19-32, 62-73, 105-118, 
and 147-158 are exposed on the cell surface. Insertion of 
a ipbd encoding fragment at about codon 111 or at about 
codon 152 is likely to cause the IPBD to be displayed on the 
cell surface. Concerning OmpA, see also MACI88 and MAN088. 
Porin Protein F of ppmidomonas aeruginosa has been cloned 
and has sequence homology to OmpA of JL. flali (DUCH88) . 
Although this homology is not sufficient to allow prediction 
of surface-exposed residues on Porin Protein F, the methods 
used to determine the topological model of OmpA may be 
applied to Porin Protein F. Works related to use of OmpA as 
an OSP include BECK80 and MACI88. 

Misra and Benson (MISR88a, MISR88b) disclose a 
topological model of fi^ afiU OmpC that predicts that, among 
others, residues GLY 164 and LEU,*, are exposed on the cell 
surface. Thus insertion of an &m gene fragment at about 
codon 164 or at about codon 250 of the i. £2lA flSffiC ST^ne or 
at corresponding codons of the S^. r.yphiroirjum fifflBC gene is 
likely to cause IPBD to appear on the cell surface. The 
om pC genes . of other bacterial species may be used. Other 
works related to OmpC include CATR87 and CLIC88. 

OmpF of 1^. coli is a very abundant OSP, slO 4 copies/ 
cell. Pages st Al*. (PAGE90) have published a model of OmpF 
indicating seven surface -exposed segments. Fusion of an 
25 ipbjl gene fragment, either as an insert or to replace the 3- 
part of oropF, in one of the indicated regions is likely to 
produce a functional nmp^tinbd gene the expression of which 
leads to display of IPBD on the cell surface. In 
particular, fusion at about codon ill, 177, 217, or 245 
should lead to a functional pmpF;;ipbd gene. Concerning 
OmpF, see also REID88b. PAGE88, BENS88, TOMM82, and SODE85. 

Pilus proteins are of particular interest because 
piliated cells express many copies of these proteins and 
because several species (JL. gonorrhoeae, P^. aeruginosa , 
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Mn^yglla fesYia, BwrrtWCTiflif aoflQfflja, and iL. fifiU) express 
related pilins. Getzoff- and coworkers (6ETZ88, PARG87. 
S0ME85) have constructed a model of the gonococcal pilus 
that predicts that the protein forms a four-helix bundle 
having structural similarities to tobacco mosaic virus 
protein and myohemerythrin. On this model, both the amino 
and carboxy termini of the protein are exposed. The amino 
terminus is methylated. Elleman (ELLE88) has reviewed 
pilins of T^tsroides aojiQSiiS and other species and serotype 
differences can be related to differences in the pilin 
protein and that most variation occurs in the C- terminal 
region. The amino -terminal portions of the pilin protein 
are highly conserved. Jennings e£ aJu (JENN89) have grafted 
a fragment of foot-and-mouth disease virus (residues 144- 
159) into the fi^ noflaaua type 4 fliribrial protein which is 
highly homologous to gonococcal pilin. They found that 
expression of the 3* - terminal fusion in ppniqinogft led to 
a viable strain that makes detectable amounts of the fusion 
protein. Jennings fit al^ did not vary the foreign epitope 
nor did they suggest any variation. They inserted a GLY-GLY 
linker between the last pilin residue and the first residue 
of the foreign epitope to provide a -flexible linker" . Thus 
a preferred place to attach an IPBD is the carboxy terminus. 
The exposed loops of the bundle could also be used, although 
25 the particular internal fusions tested by Jennings fit &JU 
(JENN89) appeared to be lethal in ajaaaiaosa. Concerning 
pilin, see also MCKE85 and ORND85. 

judd (JUDD86, JUDD85) has investigated Protein IA of SL_ 
gonorrhoeae and found that the amino terminus is exposed; 
30 thus, one could attach an IPBD at or near the amino terminus 
of the mature P.IA as a means to display the IPBD on the IL. 
gonorrhoeae surface. 

A model of the topology of PhoE of i. ssli has been 
disclosed by van der Ley fit aJL. (VAND86) . This model 
35 predicts eight loops that are exposed; insertion of an IPBD 
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into one of th se loops is likely to lead to 
IPBD on the surface of the cell. Residues 158, 201, 238, 
and 275 are preferred locations for insertion of and IPBD. 
Other OSPs that could be used include £*. fiQlL BtuB, 
5 FepA, FhuA, lutA, FecA, and FhuE (6UDM89) which are 
receptors for nutrients usually found in low abundance. The 
genes of all these proteins have been sequenced, but 
topological models are not yet available. Gudmunsdottxr et 
*U (GTOM89) have begun the construction of such a model for 
10 BtuB and FepA by showing that certain residues of BtuB face 
the periplasm and by determining the functionality of 
various BtuB : : FepA fusions. Carmel * ^ (CARM90) have 
reported work of a similar nature for FhuA. All Nexsserxa 
species express outer surface proteins for iron transport 
15 that have been identified and, in many cases, cloned. See 
also MORS 8 7 and MORS88. 

Many gram-negative bacteria express one or more 
phospholipases. 1, fiQli phospholipase A, product of the 
gldA gene, has been cloned and sequenced by de Geus *t ^ 
20 (DEGE84) . They found that the protein appears at the cell 
surface without any postradiational processing. A x£b£ 
gene fragment can be attached at either terminus or inserted 
at positions predicted to encode loops in the protein. That 
phospholipase A arrives on the outer surface without removal 
25 of a signal sequence does not prove that a PldA: :IPBD fusion 
protein will also follow this route. Thus we might cause a 
PldA-- IPBD or IPBD:: PldA fusion to be secreted xnto the 
. periplasm by addition of an appropriate signal sequence. 
Thus, in addition to simple binary fusion of an 
30 fragment to one terminus of ElflA, the constructions: 
1) fifi: :jj2fefl: :pldA 

should be tested. Once the PldA:: IPBD protexn is free in 
the periplasm it does not remember how it got there and the 
35 structural features of PldA that cause it to localize on the 
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outer surface will direct the fusion to the same 
destination. 

ttt^, R=^ rP -H;n snorps as Genpfir Pfrgkagep; 

Bacterial spores have desirable properties as GP candi- 
dates. Spores are much more resistant than vegetative 
bacterial cells or phage to chemical and physical agents, 
and hence permit the use of a great variety of affinity 
selection conditions. Also, Bacillus, spores neither 
actively metabolize nor alter the proteins on their surface. 
Bacillus spores, and more especially 1^ pubtAUs spores, are 
therefore the preferred sporoidal GPs. As discussed more 
fully in WO90/02809, a foreign binding domain may be 
introduced into an outer surface protein such as that 
encoded by the B* sjjfetilis cotC or cotD genes. 

It is generally preferable to use as the genetic 
package a cell, spore or virus for which an outer surface 
protein which can be engineered to display a IPBD has 
already been identified. However, as explained in 
WO90/02809, the present invention is not limited to such 
genetic packages, as an outer surface transport signal may 
be generated by variegation-and-selection techniques. 
V.E Genetic Construction and Expression Considerations 

The mphd-oBP gene may be: a) completely synthetic, b) 
a composite of natural and synthetic DNA, or c) a composite 
of natural DNA fragments. The important point is that the 
pbfl segment be easily variegated so as to encode a 
multitudinous and diverse family of PBDs as previously 
described. A synthetic ipfed. segment is preferred because it 
allows greatest control over placement of restriction sites. 
Primers complementary to regions abutting the asBzl&& 9*™ 
on its 3" flank and to parts of the Qsp-jpfrd gene that are 
not to be varied are needed for sequencing. 

The sequences of regulatory parts of the gene are taken 
from the sequences of natural regulatory elements: a) 
35 promoters, b) Shine-Dalgamo sequences, and c) trans- 
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criptional terminators. Regulatory elements could also be 
designed from knowledge of consensus sequences of natural 
regulatory regions. The sequences of these regulatory 
elements are connected to the coding regions; restriction 
sites are also inserted in or adjacent to the regulatory 
regions to allow convenient manipulation. 

The essential function of the affinity separation is to 
separate GPs that bear PBDs (derived from IPBD) having high 
affinity for the target from GPs bearing PBDs having low 
affinity for the target. If the elution volume of a GP 
depends on the number of PBDs on the GP surface, then a GP 
bearing many PBDs with low affinity, GP(PBD W ), might co- 
elute with a GP bearing fewer PBDs with high affinity, 
GP(PBD.) . Regulation of the ssp_ipl2a gene preferably is such 
15 that most packages display sufficient PBD to effect a good 
separation according to affinity. Use of a regulatable 
promoter to control the level of expression of the QSQzSte 
allows fine adjustment of the chromatographic behavior of 
the variegated population. 
20 induction of synthesis of engineered genes in 

vegetative bacterial cells has been exercised through the 
use of regulated promoters such as lasIZE£# £EEE» or 
(MANI82) . The factors that regulate the quantity of protein 
synthesized are sufficiently well understood that a wide 
25 variety of heterologous proteins can now be produced in fi^. 
col i , aubtilis and other host cells in at least moderate 
quantities (BETT88) . Preferably, the promoter for the QSB^ 
i pbd gene is subject to regulation by a small chemical 
inducer. For example, the la£ promoter and the hybrid £rp_- 
lac (tacj promoter are regulatable with isopropyl 
thiogalactoside (IPTG) . The promoter for the constructed 
gene need not come from a natural QSB. gene; any regulatable 
bacterial promoter can be used. A non- leaky promoter is 
preferred. 
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The present invention is not limited to a single method 
of gene design. The oas^Stei gene need not be synthesized 
in tffifcffl; parts of the gene may be obtained from nature. 
One may use any genetic engineering method to produce the 
5 correct gene fusion, so long as one can easily and 
accurately direct mutations to specific sites in the sbd DNA 
subsequence . 

The coding portions of genes to be synthesized are 
designed at the protein level and then encoded in DNA. The 

10 ambiguity in the genetic code is exploited to allow optimal 
placement of restriction sites, to create various 
distributions of amino acids at variegated codons, to 
minimize the potential for recombination, and to reduce use 
of codons are poorly translated in the host cell. 

15 V.P Structural Considerations 

The design of the amino-acid sequence for the ifibjl-flsp. 
gene to encode involves a number of structural 
considerations. The design is somewhat different for each 
type of GP. in bacteria, OSPs are not essential, so there 

20 is no requirement that the OSP domain of a fusion have any 
of its parental functions beyond lodging in the outer 
membrane . 

It is desirable that the OSP not constrain the 
orientation of the PBD domain; this is not to be confused 

25 with lack of constraint within the PBD. Cwirla at aL- 
(CWIR90) , Scott and Smith (SCOT90) , and Devlin fit 3±*. 
(DEVL90), have taught that variable residues in phage- 
displayed random peptides should be free of influence from 
the phage OSP. We teach that binding domains having a 

30 moderate to high degree of conformational constraint will 
exhibit higher specificity and that higher affinity is also 
possible. Thus, we prescribe picking codons for variegation 
that specify amino acids that will appear in a well -defined 
framework. The nature of the side groups is varied through 

35 a very wide range due to the combinatorial replacement of 
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multiple amino acids. The main chain conformations of most 
PBDs of a given class is very similar. The movement of the 
PBD relative to the OSP should not, however, be restricted. 
Thus it is often appropriate to include a flexible linker 
5 between the PBD and the OSP. Such flexible linkers can be 
taken from naturally occurring proteins known to have 
flexible regions. For example, the gill protein of M13, 
contains glycine-rich regions thought to allow the amino- 
terminal domains a high degree of freedom. Such flexible 

10 linkers may also be designed. Segments of polypeptides that 
are rich in the amino acids GLY, ASN, SER, and ASP are 
likely to give rise to flexibility. Multiple glycines are 
particularly preferred. 

When we choose to insert the PBD into a surface loop of 

15 an OSP such as LamB, OmpA, or M13 gill protein, there are a 
few considerations that do not arise when PBD is joined to 
the end of an OSP. In these cases, the OSP exerts some 
constraining influence on the PBD; the ends of the PBD are 
held in more or less fixed positions. We could insert a 

20 highly varied DNA sequence into the SSE gene at codons that 
encode a surface- exposed loop and select for cells that have 
a specific-binding phenotype. When the identified amino- 
acid sequence is synthesized (by any means) , the constraint 
of the OSP is lost and the peptide is likely to have a much 

25 lower affinity for the target and a much lower specificity. 
Tan and Kaiser (TANN77) found that a synthetic model of BPTI 
containing all the amino acids of BPTI that contact trypsin 
has a K, for trypsin -10 7 higher than BPTI. Thus, it is 
strongly preferred that the varied amino acids be part of a 

30 PBD in which the structural constrains are supplied by the 
PBD. 

It is known that the amino acids adjoining forexgn 
epitopes inserted into LamB influence the immunological 
properties of these epitopes (VAND90) . We expect that PBDs 
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inserted into loops of LamB, OmpA, or similar OSPs will be 
influenced by the amino acids of the loop and by the OSP in 
general. To obtain appropriate display of the PBD, it may 
be necessary to add one or more linlcer amino acids between 

5 the OSP and the PBD. Such linkers may be taken from natural 
proteins or designed on the basis of our knowledge of the 
structural behavior of amino acids. Sequences rich in GLY, 
SER, ASN, ASP r ARG, and THR are appropriate. One to five 
amino acids at either junction are likely to impart the 

10 desired degree of flexibility between the OSP and the PBD. 

A preferred site for insertion of the iEbfl gene xnto 
the phage ojffi gene is one in which: a) the IPBD folds into 
its original shape, b) the OSP domains fold into their 
original shapes, and c) there is no interference between the 

15 two domains. 

If there is a model of the phage that indicates that 
either the amino or carboxy terminus of an OSP is exposed to 
solvent, then the exposed terminus of that mature OSP 
becomes the prime candidate for insertion of the ipiai gene. 
20 A low resolution 3D model suffices. 

in the absence of a 3D structure, the amino and carboxy 
termini of the mature OSP are the best candidates for 
insertion of the ipbd gene, a functional fusion may require 
additional residues between the IPBD and OSP domains to 
25 avoid unwanted interactions between the domains. Random- 
sequence DNA or DNA coding for a specific sequence of a 
protein homologous to the IPBD or OSP, can be inserted 
between the flfip. fragment and the ipbji fragment if needed. 
Fusion at a domain boundary within the OSP is also a 
30 good approach for obtaining a functional fusion. Smith 
exploited such a boundary when subcloning heterologous DNA 
into gene HI of fl (SMET85) . 

The criteria for identifying OSP domains suitable for 
causing display of an IPBD are somewhat different from those 
35 used to identify and IPBD. When identifying an OSP, minimal 
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size is not so important because the OSP domain will not 
appear in the final binding molecule nor will we need to 
synthesize the gene repeatedly in each variegation round. 
The major design concerns are that: a) the OSP::IPBD fusion 
5 causes display of IPBD, b) the initial genetic construction 
be reasonably convenient, and c) the gene be 

genetically stable and easily manipulated. There are 
several methods of identifying domains. Methods that rely 
on atomic coordinates have been reviewed by Janin and 

10 Chothia (JANI85) . These methods use matrices of distances 
between a carbons (CJ , dividing planes (c£. ROSE85) , or 
buried surface (RASH84) . Chothia and collaborators have 
correlated the behavior of many natural proteins with domain 
structure (according to their definition) . Rashin correctly 

15 predicted the stability of a domain comprising residues 206- 
316 of thermolysin (VITA84, RASH84) . 

Many researchers have used partial proteolysis and 
protein sequence analysis to isolate and identify stable 
domains. (See, for example, VITA84, POTE83, SCOT87a, and 

20 PAB079.) Pabo fit al^ used calorimetry as an indicator that 
the cl repressor from the coliphage X contains two domains; 
they then used partial proteolysis to determine the location 

of the domain boundary. 

If the only structural information available is the 
25 amino acid sequence of the candidate OSP, we can use the 
sequence to predict turns and loops. There is a high 
probability that some of the loops and turns will be 
correctly predicted (cjL. Chou and Fasman, (CH0U74) ) ; these 
locations are also candidates for insertion of the iBbj! gene 

30 fragment. 

In bacterial OSPs, the major considerations are: a) 
that the PBD is displayed, and b) that the chimeric protein 
not be toxic. 
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From topological models of OSPs, we can determine 
whether the amino or carboxy termini of the OSP is exposed. 
If so, then these are excellent choices for fusion of the 
oap fragment to the ipbd fragment. 

The lamB gene has been sequenced and is available on a 
variety of plasmids (CLEM81, CHARS Ba,b) . Numerous fusions 
of fragments of lamB with a variety of other genes have been 
used to study export of proteins in B*. soli- From various 
studies, Charbit s£ 3±*. (CHAR8 8a, b) have proposed a model 
that specifies which residues of LamB are: a) embedded in 
the membrane, b) facing the periplasm, and c) facing the 
cell surface; we adopt the numbering of this model for amino 
acids in the mature protein. According to this model, 
several loops on the outer surface are defined, including: 
15 l) residues 88 through 111, 2) residues 145 through 165, and 
3) 236 through 251. 

Consider a mini -protein embedded in LamB. For example, 
insertion of DNA encoding G,NXGX,XXXCX,oSG I2 between codons 153 
and 154 of lamB is likely to lead to a wide variety of LamB 
20 derivatives being expressed on the surface of I*. fifili cells. 
G„ N 2 , S„, and G„ are supplied to allow the mini -protein 
sufficient orientational freedom that is can interact 
optimally with the target. Using affinity enrichment 
(involving, for example, FACS via a fluorescently labeled 
25 target, perhaps through several rounds of enrichment), we 
might obtain a strain (named, for example, BEST) that 
expresses a particular LamB derivative that shows high 
affinity for the predetermined target. An octapeptide 
having the sequence of the inserted residues 3 through 10 
from BEST is likely to have an affinity and specificity 
similar to that observed in BEST because the octapeptide has 
an internal structure that keeps the amino acids in a 
conformation that is quite similar in the LamB derivative 
and in the isolated mini -protein. 
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Fusing one or more new domains to a protein may mate 
the ability of the new protein to be exported from the cell 
different from the ability of the parental protein. The 
signal peptide of the wild-type coat protein may function 
, fofauthentic polypeptide but be unable to direct export^ 
a fusion, to utilize the Sec-dependent pathway, one may 
need a different signal peptide. Thus, to express^ and 
display a chimeric BPTI/M13 gene VIII protexn. we found it 
necessary to utilize a heterologous signal peptide ( that of 

" ^'gPs that display peptides having high affinity for the 
target may be quite difficult to elute from the target, 
particularly a multivalent target. (Bacteria that are bound 
very tightly can simply multiply in da-) *» *»■•• ™ 
„ cJintroduce a cleavage site for a specific proteas such 
as blood-clotting Pactor Xa. into the fusion OSP P***»~ 
that the binding domain can be cleaved from the genetic 
package. Such cleavage has the advantage that all resulting 
phage have identical OSPs and therefore are equally 
20 infective, even if polypeptide- displaying phage , « ^ be 
eluted from the affinity matrix without cleavage. Thxe step 
allows recovery of valuable genes which might otherwise be 
To ou/Lowledge. no one has disclosed or suggested 
using a specific protease as a means to recover an 
2S information- containing genetic package or of converting a 
Population of phage that vary in infectivity into phage 
having identical infectivity. 
tv, a , avnt li^p »g Genfi Inserts 

Thi present invention is not limited to any particular 
30 method or strategy of DNA synthesis or constructxon 
conventional DK* synthesizers may be used, with approbate 
reagent modifications for production of varxegated .»» 
(similar to that now used for production of mixed probesK 
Th, ass ^ Sa ^ may be created by ineertxng vgDHA 
35 into an existing parental gene, such as the ssa^ shown 
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to be displayable by a suitably transformed GP. The present 
invention is not limited to any particular method of 
introducing the vgDNA, e.g., cassette mutagenesis or single- 
stranded- oligonucleotide-directed mutagenesis 
tv, |f Opgrativ A Cloning Vector 

The operative cloning vector (OCV) is a replicable 
nucleic acid used to introduce the chimeric iBfefl-flSB or 
iEbd-fiSE gene into the genetic package. When the genetic 
package is a virus, it may serve as its own OCV. For cells 
and spores, the OCV may be a plasmid, a virus, a phagemid, 
or a chromosome. 
tv . t. Transf ormation of cells: 

When the GP is a cell, the population of GPs is created 
by transforming the cells with suitable OCVs. When the GP 
is a phage, the phage are genetically engineered and then 
transfected into host cells suitable for amplification. 
When the GP is a spore, cells capable of sporulation are 
transformed with the OCV while in a normal metabolic state, 
and then sporulation is induced so as to cause the OSP-PBDs 
to be displayed. The present invention is not limited to 
any one method of transforming cells with DNA. 

The transformed cells are grown first under non- 
selective conditions that allow expression of plasmid genes 
and then selected to kill untransformed cells. Transformed 
25 cells are then induced to express the psp-pbfl gene at the 
appropriate level of induction. The GPs carrying the IPBD 
or PBDs are then harvested by methods appropriate to the GP 
at hand, generally, centrifugation to pelletize GPs and 
resuspension of the pellets in sterile medium (cells) or 
buffer (spores or phage). They are then ready for 
verification that the display strategy was successful (where 
the GPs all display a "test" IPBD) or for affinity selection 
(where the GPs display a variety of different PBDs) . 
tv.t. verification of Display Strategy; 
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The harvested packages are tested to determine whether 
the IPBD is present on the surface. In any tests of GPs for 
the presence of IPBD on the GP surface, any ions or 
cofactors known to be essential for the stability of IPBD or 
5 AfM(IPBD) are included at appropriate levels. The tests can 
be done, e.g., by a) by affinity labeling, b) enzymatically, 
c) spectrophotometrically, d) by affinity separation, or e) 
by affinity precipitation. The AfM(IPBD) in this step xs 
one picked to have strong affinity (preferably, 
10 Kj < 10- u M) for the IPBD molecule and little or no af f xnxty 
for the wtGP. 

V. AFFINITY SELECTION OF TARGET -BINDING MUTANTS 

rr* affinity sepe* ?^™ Technology, Generally 

Affinity separation is used initially in the present 
15 invention to verify that the display system is working, 
i, e , that a chimeric outer surface protein has been 
expressed and transported to the surface of the genetic 
package and is oriented so that the inserted binding domain 
is accessible to target material. When used for this 
20 purpose, the binding domain is a known binding domain for a 
particular target and that target is the affinity molecule 
used in the affinity separation process. For example, a 
display system may be validated by using inserting DNA 
encoding BPTI into a gene encoding an outer surface protexn 
25 of the genetic package of interest, and testing for binding 
to anhydrotrypsin, which is normally bound by BPTI. 

If the genetic packages bind to the target, then we 
have confirmation that the corresponding binding domain xs 
indeed displayed by the genetic package. Packages which 
30 display the binding domain (and thereby bind the target) 
are separated from those which do not. 

Once the display system is validated, it is possible to 
use a variegated population of genetic packages whxch 
display a variety of different potential binding domains, 
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and use affinity separation technology to determine how well 
they bind to one or more targets. This target need not be 
one bound by a known binding domain which is parental to the 
displayed binding domains, i^, one may select for binding 
5 to a new target. 

The term "affinity separation means" includes, but is 
not limited to: a) affinity column chromatography, b) batch 
elution from an affinity matrix material, c) batch elution 
from an affinity material attached to a plate, d) fluores- 
10 cence activated cell sorting, and e) electrophoresis in the 
presence of target material. "Affinity material" is used to 
mean a material with affinity for the material to be 
purified, called the "analyte". In most cases, the 
association of the affinity material and the analyte is 
reversible so that the analyte can be freed from the 
affinity material once the iirapurities are washed away. 
v.r. Affip Ht-Y Chromatography. Generally 

Affinity column chromatography, batch elution from an 
affinity matrix material held in some container, and batch 
elution from a plate are very similar and hereinafter will 
be treated under "affinity chromatography." 

If affinity chromatography is to be used, then: 

1) the molecules of the target material must be of 
sufficient size and chemical reactivity to be applied 

25 to a solid support suitable for affinity separation, 

2) after application to a matrix, the target material 
preferably does not react with water, 

3) after application to a matrix, the target material 
preferably does not bind or degrade proteins in a non- 
30 specific way, and 

4) the molecules of the target material must be suffi- 
ciently large that attaching the material to a matrix 
allows enough unaltered surface area (generally at 
least 500 A 2 , excluding the atom that is connected to 

35 the linker) for protein binding. 
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Affinity chromatography is the preferred separation 
means, but FACS, electrophoresis, or other means may also be 
used. 

The present invention makes use of affinity separatxon 
of bacterial cells, or bacterial viruses (or other genetic 
packages) to enrich a population for those cells or viruses 
carrying genes that code for proteins with desirable binding 
properties . 

v „ rv Targe t Materials > 

The present invention may be used to select for bxndxng 
domains which bind to one or more target materials, and/or 
fail to bind to one or more target materials. Specxfxcxty, 
of course, is the ability of a binding molecule to bxnd 
strongly to a limited set of target materials, while bindxng 
more weakly or not at all to another set of target materials 
from which the first set must be distinguished. 

The target materials may be organic macromolecules , 
such as polypeptides, lipids, polynucleic acids, and 
polysaccharides, but are not so limited. The present 
invention is not, however, limited to any of the above- 
identified target materials. The only limitation is that 
the target material be suitable for affinity separation. 
Thus, almost any molecule that is stable in aqueous solvent 
may be used as a target. 

Serine proteases such as human neutrophil elastase 
(HNE) are an especially interesting class of potential 
target materials. Serine proteases are ubiquitous in living 
organisms and play vital roles in processes such as: 
digestion, blood clotting, fibrinolysis, immune response, 
fertilization, and post-translational processing of peptide 
hormones. Although the role these enzymes play is vital, 
uncontrolled or inappropriate proteolytic activity can be 
very damaging. 

v n rmmob^^-i"" °r Labeling o f Target. Material 
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For chromatography, FACS, or electrophoresis there may 
be a need to covalently link the target material to a second 
chemical entity. For chromatography the second entity is a 
matrix, for FACS the second entity is a fluorescent dye, and 
5 for electrophoresis the second entity is a strongly charged 
molecule. In many cases, no coupling is required because 
the target material already has the desired property of: a) 
immobility, b) fluorescence, or c) charge. In other cases, 
chemical or physical coupling is required. 
10 It is not necessary that the actual target material be 

used in preparing the immobilized or labeled analogue that 
is to be used in affinity separation; rather, suitable 
reactive analogues of the target material may be more 
convenient. Target materials that do not have reactive 
15 functional groups may be immobilized by first creating a 
reactive functional group through the use of some powerful 
reagent, such as a halogen. In some cases, the reactive 
groups of the actual target material may occupy a part on 
the target molecule that is to be left undisturbed. In that 
20 case, additional functional groups may be introduced by 
synthetic chemistry. 

Two very general methods of immobilization are widely 
used. The first is to biotinylate the compound of interest 
and then bind the biotinylated derivative to immobilized 
25 avidin. The second method is to generate antibodies to the 
target material, immobilize the antibodies by any of 
numerous methods, and then bind the target material to the 
immobilized antibodies. Use of antibodies is more 
appropriate for larger target materials; small targets 
30 (those comprising, for example, ten or fewer non-hydrogen 
atoms) may be so completely engulfed by an antibody that 
very little of the target is exposed in the target -antibody 
complex. 

Non-covalent immobilization of hydrophobic molecules 
35 without resort to antibodies may also be used. A compound, 
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such as 2,3,3-trimethyldecane is blended with a matrix 
precursor, such e:s sodium alginate, and the mixture is 
extruded into a hardening solution. The resulting beads 
will have 2,3,3-trimethyldecane dispersed throughout and 

exposed on the surface. 

Other immobilization methods depend on the presence of 
particular chemical functionalities. A polypeptide will 
present -NH 2 (N- terminal; Lysines). -C0OH ( C- terminal; 
Aspartic Acids; Glutamic Acids). -OH (Serines; Threonines; 
Tyrosines), and -SH (Cysteines). For the reactivity of 
amino acid side chains, see CREI84. A polysaccharide has 
free -OH groups, as ' does DNA, which has a sugar backbone. 

Matrices suitable for use as support materials include 
polystyrene, glass, agarose and other chromatographic 
supports, and may be fabricated into beads, sheets, columns, 
wells, and other forms as desired. 

Early in the selection process, relatively high 
concentrations of target materials may be applied to the 
matrix to facilitate binding; target concentrations may 
subsequently be reduced to select for higher affinity SBDs. 
y r- V v ri nn of Lo w »r affinity PBD-Bearin? Opnefic Packages 
The population of GPs is applied to an affinity matrix 
under conditions compatible with the intended use of the 
binding protein and the population is fractionated by 
25 passage of a gradient of some solute over the column. The 
process enriches for PBDs having affinity for the target and 
for which the affinity for the target is least affected by 
the eluants used. The enriched fractions are those 
containing viable GPs that elute from the column at greater 
30 concentration of the eluant. 

The eluants preferably are capable of weakening 
noncovalent interactions between the displayed PBDs and the 
immobilized target material. Preferably, the eluants do not 
kill the genetic package; the genetic message corresponding 
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to successful mini-proteins is most conveniently amplified 
by reproducing the genetic package rather than by An vitro 
procedures such as PCR. The list of potential eluants 
includes salts (including Na+. NH«+. Rb+, S0 4 --, HjP0 4 -, 
5 citrate, K+. Li+, Cs+, HS0 4 -, C0 3 --, Ca++, Sr++, C1-, P0 4 ---, 
HCO3-, Mg ++ . Ba ++ , Br- f HP0 4 -- and acetate) , acid, heat, com- 
pounds known to bind the target, and soluble target material 
(or analogues thereof) . 

The uneluted genetic packages contain DNA encoding 
10 binding domains which have a sufficiently high affinity for 
the target material to resist the elution conditions. The 
DNA encoding such successful binding domains may be 
recovered in a variety of ways. Preferably, the bound 
genetic packages are simply eluted by means of a change in 
the elution conditions. Alternatively, one may culture the 
genetic package is fiitll. or extract the target- containing 
matrix with phenol (or other suitable solvent) and amplify 
the DNA by PCR or by recombinant DNA techniques. 
Additionally, if a site for a specific protease has been 
20 engineered into the display vector, the specific protease is 
used to cleave the binding domain from the GP. 

Nonspecific binding to the matrix, etc., may be 
identified or reduced by techniques well known in the 
affinity separation art. 
25 V.F. Rec overy of PackaqeS-L 

Recovery of packages that display binding to an 
affinity column may be achieved in several ways, including: 

1) collect fractions eluted from the column with a 
gradient as described above; fractions eluting later 
in the gradient contain GPs more enriched for genes 
encoding PBDs with high affinity for the column, 

2) elute the column with the target material in soluble 
form. 
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3) flood the matrix with a nutritive medium and grow the 
desired packages ia fiitllf 

4) remove parts of the matrix and use them to inoculate 
growth medium, 

5 5) chemically or enzymatically degrade the linkage 

holding the target to the matrix so that GPs still 
bound to target are eluted, or 
6) degrade the packages and recover DNA with phenol or 
other suitable solvent; the recovered DNA is used to 
10 transform cells that regenerate GPs. 

It is possible to utilize combinations of these methods. It 
should be remembered that what we want to recover from the 
affinity matrix is not the GPs per sjl, but the information 
in them. Recovery of viable GPs is very strongly preferred, 
15 but recovery of genetic material is essential. If cells, 
spores, or virions bind irreversibly to the matrix but are 
not killed, we can recover the information through in situ 
cell division, germination, or infection respectively. 
Proteolytic degradation of the packages and recovery of DNA 
20 is not preferred. 

V-G. Amplifying the Enrich ed Packages 
Viable GPs having the selected binding trait are 
amplified by culture in a suitable medium, or, in the case 
of phage, infection into a host so cultivated. If the GPs 
25 have been inactivated by the chromatography, the OCV 
carrying the asp.-pj2d. gene are recovered from the GP, and 
introduced into a new, viable host. 
v , w. rharacr ^^incr thf» Putative SPDs; 

For one or more clonal isolates, we may subclone the 
30 sid gene fragment, without the afip fragment, into an expres- 
sion vector such that each SBD can be produced as a free 
protein. Physical measurements of the strength of binding 
may be made for each free SBD protein by any suitable 
method. 
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If we find that the binding is not yet sufficient, we 
decide which residues of the SBD (now a new PPBD) to vary 
next. If the binding is sufficient, then we now have a 
Session vector bearing a gene encoding the desired novel 
; binding protein. 

Tr,T iToinr factions; 

One may modify the affinity separation of the methoa 
described to select a molecule that binds to material A but 
no t to material B, or that binds to both A and B, either 

0 alternatively or simultaneously. 
T T Enaln f - rlTH antagonists 

It may be desirable to provide an antagonist to an 
enzyme or receptor. This may be achieved by making a 
molecule that prevents the natural substrate or agonist from 

s ^acTing the a'ctive site, joules that bind directly 
the active site may be either agonists or antagonists. Thus 
we adopt the following strategy, we consider enzymes and 
receptors together under the designation TER (Target Enzyme 

, 0 OI "por^st TERs, there exist chemical inhibitors that 
block the active-site, usually, these chemicals are useful 
only as research tools due to highly toxicity. We make two 
affinity matrices: one with active TER and one with blocked 
TER. «e make a variegated population of GP(PBD) s and select 
2 5 for SBPS that bind to both forms of the enzyme, thereby 
obtaining SDPs that do not bind to the active site. We 
expect that SBDs will be found that bind different places on 
the enzyme surface. Pairs of the sifl genes are fused with 
an intervening peptide segment. For example, if SBD-1 and 
30 SBD-2 are binding domains that show high affinity. for the 
target enzyme and for which the binding is non- competitive, 
then the gene ■ -1 InKftr; isM-2 encodes a two-domain 

protein that will show high affinity for the target. We 
L*e several fusions having a variety of SBDs and various 
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linkers. Such compounds have a reasonable probability of 
being an antagonist to the target enzyme. 

VI. EXPLOITATION OF SUCCESSFUL BINDING DOMAINS AND 
5 CORRESPONDING DNAS 

While the SBD may be produced by recombinant DNA 
techniques, an advantage inhering from the use of a mini- 
protein as an IPBD is that it is likely that the derived SBD 
will also behave like a mini -protein and will be obtainable 

10 by means of chemical synthesis. (The term "chemical 
synthesis n , as used herein, includes the use of enzymatic 
agents in a cell -free environment.) 

It is also to be understood that mini-proteins obtained 
by the method of the present invention may be taken as lead 

15 coirpounds for a series of homologues that contain non- 
naturally occurring amino acids and groups other than amino 
acids. For example, one could synthesize a series of 
homologues in which each member of the series has one amino 
acid replaced by its D enantiomer. One could also make 

20 homologues containing constituents such as 0 alanine, 
aminobutyric acid, 3-hydroxyproline, 2-Aminoadipic acid, 
ethylasperagine, norvaline, etc. : these would be tested for 
binding and other properties of interest, such as stability 
and toxicity. 

25 Peptides may be chemically synthesized either in 

solution or on supports. Various combinations of stepwise 
synthesis and fragment condensation may be employed. 

During synthesis, the amino acid side chains are 
protected to prevent branching. Several different 
30 protective groups are useful for the protection of the thiol 
groups of cysteines: 

1) 4-methoxybenzyl (MBzl; Mob) (NISH82; ZAFA88) , removable 
with HF; 
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2) acetamidomethyl (Acm) (NISH82; NISH86; BECK890 , 
removable with iodine; mercury ions (e^, mercuric 
acetate) ; silver nitrate; and 

3) S-para-methoxybenzyl (H0UG84) . 
Other thiol protective groups may be found in standard 

reference works such as Greene, PROTECTIVE GROUPS IN ORGANIC 

SYNTHESIS (1981) . 

Once the polypeptide chain has been synthesized, 
disulfide bonds must be formed. Possible oxidizing agents 
include air (HOUG84; NISH86) , ferricyanide (NISH82; HOUG84) , 
iodine (NISH82) , and performic acid (HOUG84) . Temperature, 
pH, solvent, and chaotropic chemicals may affect the course 

of the oxidation. 

A large number of micro-proteins with a plurality of 
disulfide bonds have been chemically synthesized in 
biologically active form: conotoxin Gl (13AA, 4 Cys) (NISH- 
82); heat-stable enterotoxin ST (18AA, 6 Cys) (HOUG84) ; 
analogues of ST (BHAT86) ; Q- conotoxin GVIA (27AA, 6Cys) (N- 
ISH86; RIVI87b); Q- conotoxin MVIIA (27 AA, 6 Cys) (OLIV87b) ; 
a -conotoxin SI (13 AA, 4 Cys) (ZAPA88) ; /x-conotoxin Ilia 
(22AA. 6 Cys) (BECK89C, CRUZ89, HATA90) . Sometimes, the 
polypeptide naturally folds so that the correct disulfide 
bonds are formed. Other times, it must be helped along by 
use of a differently removable protective group for each 

25 pair of cysteines. 

The successful binding domains of the present invention 
may, alone or as part of a larger protein, be used for any 
purpose for which binding proteins are suited, including 
isolation or detection of target materials. In furtherance 

30 of this purpose, the novel binding proteins may be coupled 
directly or indirectly, covalently or noncovalently, to a 
label, carrier or support. 

When used as a pharmaceutical, the novel binding 
proteins may be contained with suitable carriers or 

35 adjuvants. 
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EXAMPLE I 

DESIGN AND MUTAGENESIS OF A CLASS 1 MICRO- PROTEIN 

To obtain a library of binding domains that are 
conformational^ constrained by a single disulfide, we 
insert DNA coding for the following family of micro-proteins 
into the gene coding for a suitable OSP- 



10 



where I I indicates disulfide bonding. Disulfides 

normally do not form between cysteines that are consecutive 

15 on the polypeptide chain. One or more of the residues 
indicated above as X» will be varied extensively to obtain 
novel binding. There may be one or more amino acids that 
precede X, or follow X«, however, the residues before X t or 
after X« will not be significantly constrained by the 

20 diagrammed disulfide bridge, and it is less advantageous to 
vary these remote, unbridged residues. The last X residue 
is connected to the OSP of the genetic package. 

X„ Xj, X 3 , X*, Xs, and X« can be varied independently; 
i.e. a different scheme of variegation could be used at each 

25 position. X, and X, are the least constrained residues and 
may be varied less than other positions. 

X t and X« can be, for example, one of the amino acids 
[E, K, T, and A]; this set of amino acids is preferred 
because: a) the possibility of positively charged, negative- 

30 ly charged, and neutral amino acids is provided, b) these 
amino acids can be provided in 1:1:1:1 ratio via the codon 
RMS (R = equimolar A and G, M = eguimolar A and C) , and c) 
these amino acids allow proper processing by signal 
peptidases . 

35 in a preferred embodiment, Xj, X3, Xi and Xj are 

initially variegated by encoding each by the codon NUT, 
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which encodes the substitution set [F, S, Y, C, L, P, H, R, 
I, T, N, V, A, D, and G] . 

The advantages of the NNT over the NNK codon become 
increasingly apparent as the number of variegated codons 
increased. Tables 10 and 130 compare libraries in which six 
codons have been varied either by NNT or NNK codons. NNT 
encodes 15 different amino acids and only 16 DNA sequences. 
Thus, there are 1.139 • 10 7 amino -acid sequences, no stops, 
and only 1.678 - 10 7 DNA sequences. A library of 10* 
independent transformants will contain 99% of all possible 
sequences. The NNK library contains 6.4 - 10 7 sequences, 
but complete sampling requires a much larger number of 
independent transformants. 

This sequence can be displayed as a fusion to the gene 
15 III protein of M13 using the native M13 gene III promoter 
and signal sequence. The sequence of M13 gene III protein, 
from residue 16 to 23, is S^HSAETVEj,; signal peptidase- 1 
cleaves after S M . We replace this segment with 
S l «GA,^EGX 1 CX J X3X4X5C2C«SYIEGRVIETVE . 
20 Note that changing H„S U to GA does not impare the phage for 
infectivity. It is useful to insert a bovine F.Xa 
recognition/ cleavage site (YIEGR/VI) between the PBD and the 
mature III protein; this not only allows orientational 
freedom for the PBD, but also allows cleavage of the PBD 

25 from the GPi 

A phage library in which X,, X,, X s , and Xs are encoded 
by NNT {allowing F, S, Y, C, L, P, H, R, V, T, N, V, A, D, 
& G) and in which X3 and X4 are encoded by NNG (allowing L, 
S, W, P, Q, R, M, T, K, V, A, E, and G) is named TN2. This 
library displays about 8.55 x 10* micro -proteins encoded by 
about 1.5 x 10 7 DNA sequences. NNG is used at the third and 
fourth variable positions (the central positions of the 
disulfide -closed loop) at least in part to avoid the 
possibility of cysteines at these positions. 
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Devlin, et al., screened 10 7 transformants, each of 
which could display one of I0 a random pentadecapeptides , for 
affinity with streptavidin, and found 20 streptavidln- 
binding phage isolates, with eight unique sequences ("A»- 
5 "i"). All contained HP; 15/20, HPQ; and 6/20, HPQF, though 
in different positions within the pentadecapeptide . The 
most frequently encountered isolates were D(5), 1(4), and 
A(3), which entirely lacked cysteines. However, two 
positive isolates, «E"(l) and «F« (2) , included a pair of 
10 cysteines positioned so that formation of a disulfide bond 
was possible. The sequences of these isolates is given in 

Table 820. t 

we recognized that our TN2 library should include a 
putative micro-protein, HPQ, similar enough to Devlin's «E" 
15 and «F- peptides to have the potential of exhibiting 
streptavidin-binding activity. HPQ comprises the AEG amino 
terminal sequence common to all members of the TN2 library, 
followed by the sequence PCHPQFCQ which has the potential 
for forming a disulfide bridge with a span of four, followed 
20 by a serine (S) and a bovine factor Xa recognition site 
(YIEGR/IV) (see Table 820) . Pilot experiments showed that 
the binding of HPQ-bearing phage to streptavidin was 
comparable to that of Devlin's «F" isolate; both were 
marginally above background (1.7x) . We therefore screened 
25 our TN2 library against immobilized streptavidin. 

Streptavidin is available as free protein (Pierce) with 
a specific activity of 14.6 units per mg (1 unit will bind 
1 M g of biotin) . A stock solution of 1 mg per ml in PBS 
containing 0.01% azide is made. 100/iL of StrAv stock is 
30 added to each 250 ,iL capacity well of Immulon (#4) plates 
and incubated overnight at 4-C. The stock is removed and 
replaced with 250 fih of PBS containing BSA at a 
concentration of 1 mg/mL and left at 4°C for a further 1 
hour. Prior to use in a phage binding assay the wells are 
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washed rapidly 5 times with 250 fUL of PBS containing 0.1% 
Tween. 

To each StrAv- coated well is added 100 ph of binding 
buffer (PBS with 1 mg per mL BSA) containing a known 
5 quantity of phage (10» pfu's of the TN2 library), 
incubation proceeds for 1 hr at room temperature followed by 
removal of the non-bound phage and 10 rapid washes with PBS 
0.1% Tween, then further washed with citrate buffers of pH 
7, 6 and 5 to remove non-specific binding. The bound phage 

10 are eluted with 250 fiL of pH2 citrate buffer containing 1 mg 
per mL BSA and neutralization with 60 fiL of 1M tris pH 8. 
The eluate was used to infect bacterial cells which 
generated a new phage stock to be used for a further round 
of binding, washing and elution. The enhancement cycles 

15 were repeated two more times (three in total) after which 
time a number of individual phage were sequenced and tested 
as clonal isolates. The number of phage present in each 
step is determined as plaque forming units (pfu's) following 
appropriate dilutions and plating in a lawn of F- containing 

20 E. coll. 

Table 838 shows the peptide sequences found to bind to 
StrAv and their frequency in the random picks taken from the 
final (round 3) phage pool. 

The intercysteine segment of all of the putative micro - 

25 proteins examined contained the HPQF motif. The variable 
residue before the first cysteine could have contained any 
of {F,S,Y,C,L,P,H,R,I,T,N,V,A,D,G}; the residues selected 
were {Y,H,L,D,N} while phage HPQ has P. The variable 
residue after the second cysteine also could have had 

30 {F,S,Y,C,L,P,H,R,I,T,N,V,A,D,6}; the residues selected were 
{P,S,G,R,V} while phage HPQ has Q. The relatively poor 
binding of phage HPQ could be due to P 4 or to Q u or both. 
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in a control experiment, the TN2 library was screened 
in an identical manner to that shown above but with the 
target protein being the blocking agent BSA. Following 
three rounds of binding, elution, and amplification, sixteen 
random phage plaques were picked and sequenced. Half of the 
clones demonstrated a lack of insert (8/16) , the other half 
had the sequences shown in Table 839. There is no consensus 

for this collection. 

We have displayed a related micro-protein, HPQ6, on 
phage. It is identical to HPQ except for the replacement of 
CHPQFC with CHPQFPRC (see Table 820) . When displayed, HPQ6 
had a substantially stronger affinity for streptavidin than 
either HPQ or Devlin's "F- isolate. (Devlin's »E« isolate 
was not studied.) Treatment with dithiothreitol (DTT) 
markedly reduced the binding of HPQ6 phage (but not control 
phage) to streptavidin, suggesting that the presence of a 
disulfide bridge within the displayed peptide was required 
for good binding. In view of the results of the screening 
of the TN2 library, it is likely that the binding of phage 
HPQ6 could be further improved by changing P 4 to. one of 
{Y,H,L,D,N} and/or changing Q l3 to one of {P,S,G,R,V}. 



EXAMPLE II 
A CYS: : HELIX: :TDRK: :STRAHD: tCYS UNIT 

The parental Class 2 micro-protein may be a naturally- 
occurring Class 2 micro-protein. It may also be a domain of 
a larger protein whose structure satisfies or may be 
modified so as to satisfy the criteria of a class 2 micro- 
protein. The modification may be a simple one, such as the 
30 introduction of a cysteine (or a pair of cysteines) into the 
base of a hairpin structure so that the hairpin may be 
closed off with a disulfide bond, or a more elaborate one, 
so as the modification of intermediate residues so as to 
achieve the hairpin structure. The parental class 2 micro- 



WO 92/15677 



PCT/US92/01456 



10 



85 

protein may also be a composite of structures from two or 
more naturally- occurring proteins, an a helix of one 

protein and a j8 strand of a second protein. 

One micro-protein motif of potential use comprises a 
disulfide loop enclosing a helix, a turn, and a return 
strand. Such a structure could be designed or it could be 
obtained from a protein of known 3D structure. Scorpion 
neurotoxin, variant 3, (ALMA83a, ALMA83b) (hereafter 
ScorpTx) contains a structure diagrammed in Figure 1 that 
comprises a helix (residues N22 through N33) , a turn 
(residues 33 through 35) , and a return strand (residues 36 
through 41) . ScorpTx contains disulfides that join residues 
12-65, 16-41, 25-46, and 29-48. CYS* and CYS 4 , are quite 
close and could be joined by a disulfide without deranging 
15 the main chain. Figure l shows CYSj, joined to CYS 4 i. In 
addition, CYS» has been changed to GLN. It is expected that 
a disulfide will form between 25 and 41 and that the helix 
shown will form; we know that the amino -acid sequence shown 
is highly compatible with this structure. The presence of 
20 GLY M , 6LY M , and GLY M give the turn and extended strand 
sufficient flexibility to accommodate any changes needed 
around CYS 4 , to form the disulfide. 

From examination of this structure (as found in entry 
1SN3 of the Brookhaven Protein Data Bank) , we see that the 
25 following sets of residues would be preferred for variega- 
tion: 



WO 92/15677 



PCT/US92/01456 



10 



86 

SET 1 

Pesidue Allowed amino acids Naa/Ndna 

1) NNG L a R a MVSPTAQKEWG. 13/15 

LMVPTAGKE 9/9 

LMVPTAGKE 9/9 

4) K 32 VHG LMVPTAGKE 9/9 

5) G M NNG L 2 R a MVSPTAQKEWG . 13/15 
VHG LMVPTAGKE 9/9 
VAS HQNKED 6 / 6 



2) E 2g VHG 

3) A3, VHG 



6) E23 

7) Q34 



Note: Exponents on amino acids indicate multiplicity of 
codons . 

Positions 27, 28, 31, 32, 24, and 23 comprise one face 
of the helix. At each of these locations we have picked a 

15 variegating codon that a) includes the parental amino acid, 
b) includes a set of residues having a predominance of helix 
favoring residues, c) provides for a wide variety of amino 
acids, and d) leads to as even a distribution as possible. 
Position 34 is part of a turn. The side group of residue 34 

20 could interact with molecules that contact the side groups 
of resideus 27, 28, 31, 32, 24, and 23. Thus we allow 
variegation here and provide amino acids that are compatible 
with turns. The variegation shown leads to 6.65-10 6 amino 
acid sequences encoded by 8.85-10 6 DNA sequences. 

25 SET 2 

p og -Mng rnrinn allowed amino acids Naa/Ndna 

1) Das VHS L a IMV a P a T a A a HQNKDE 13/18 

2) T„ NNG L a R a MVSPTAQKEWG . 13/15 

3) Kjq VHG KEQPTALMV 9/9 
30 4) Aj, VHG KEQPTALMV 9/9 



VHG LMVPTAGKE 9/9 

6) S„ RRT SNDG 4 / 4 

NHT YSFHPLNTIDAV 9/9 



5) K 32 

337 

7) Y 38 
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Positions 26, 27, 30, 31, and 32 are variegated so a 
to enhance helix- favoring amino acids in the population. 
Residues 37 and 38 are in the return strand so that we pick 
different variegation codons. This variegation allows 
5 4.43 -10 6 amino-acid sequences and 7.08-10 6 DNA sequences. 
Thus a library that embodies this scheme can be sampled very 
efficiently. 

EXAMPLE III 

10 DESIGN AND MUTAGENESIS OP CLASS 3 MICRO- PROTEIN 

two n-imiifide Bond Parental Micro- Proteins 

Micro-proteins with two disulfide bonds may be modelled 
after the a-conotoxins, e.g. . GI, GIA, Gil, MI, and SI. 
These have the following conserved structure: 

15 

1 2 1' 2' 

(1-2 AAs) -C-C- (3 AAs)-C-(5 AAs) -C- (0-5 AAs) 



Hashimoto fit aLu (HASH85) reported synthesis of twenty- 
four analogues of a conotoxins GI, GII, and MI. Using the 
numbering scheme for GI (CYS at positions 2, 3, 7, and 13) , 

25 Hashimoto fit aL. reported alterations at 4, 8, 10, and 12 
that allows the proteins to be toxic. Almquist £t aL 
(ALMQ89) synthesized [des-GLU,] a Conotoxin GI and twenty 
analogues. They found that substituting GLY for PRO, gave 
rise to two isomers, perhaps related to different disulfide 

30 bonding. They found a number of substitutions at residues 

8 through 11 that allowed the protein to be toxic. Zafar- 
alla st aL*. (ZAFA88) found that substituting PRO at position 

9 gives an active protein. Each of the groups cited used 
only in vivo toxicity as an assay for the activity. From 

35 such studies, one can infer that an active protein has the 



WO 92/15677 



PCT/US92/01456 



88 

parental 3D structure, but one can not infer that an 
inactive protein lacks the parental 3D structure. 

Pardi fit Si,. (PARD89) determined the 3D structure of a 
Cono toxin GI obtained from venom by NMR. Kobayashi fit al^. 
5 (KOBA89) have reported a 3D structure of synthetic a 
Conotoxin 61 from NMR data which agrees with that of PARD89 . 
We refer to Figure 5 of Pardi e£ al^.. 

Residue GLU X is known to accomodate GLU, ARG, and ILE 
in known analogues or homologues. A preferred variegation 

10 codon is NNG that allows the set of amino acids [L'R'MVSPTA- 
QKEWG<stop>] . Prom Figure 5 of Pardi StaL.«e see that the 
side group of GLU, projects into the same region as the 
strand comprising residues 9 through 12. Residues 2 and 3 
are cysteines and are not to be varied. The side group of 

15 residue 4 points away from residues 9 through 12; thus we 
defer varying this residue until a later round. PRO $ may be 
needed to cause the correct disulfides to form; when GLY was 
substituted here the peptide folded into two forms, neither 
of which is toxic. It is allowed to vary PRO s , but not 

20 perf erred in the first round 



30 



No substitutions at ALA* have been reported. 



A 



preferred variegation codon is RMG which gives rise to ALA, 
THR, LYS, and GLU (small hydrophobic, small hydrophilic, 
positive, and negative) . CYS 7 is not varied. We prefer to 
25 leave GLY, as is, although a homologous protein having ALAg 
is toxic. Homologous proteins having various amino acids at 
position 9 are toxic; thus, we use an NNT variegation codon 
which allows FS 2 YCLPHRITNVADG . We use NNT at positions 10, 
11, and 12 as well. At position 14, following the fourth 
CYS, we allow ALA, THR, LYS, or GLU (yla an RMG codon) . 
This variegation allows 1.053-10 7 anino-acid sequences, 
encoded by 1 . 68 • 10 7 DNA sequences . Libraries having 2 . 0 ♦ 10 7 , 
3.0-10 7 , and 5.0-10 7 independent transformants will, 
respectively,, display -70%, -83%, and -95% of the allowed 
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sequences. Other variegations are also appropriate. 

Concerning a conotoxins, see, intsE alia, ALMQ89, CRUZ85, 

GRAY83, GRAY84, and PARD89. 

The parental micro -protein may instead be one of the 

5 proteins designated "Hybrid- I" and "Hybrid- II" by Pease s£ 

al. (PEAS90) ; cfj. Figure 4 of PEAS90. One preferred set of 

residues to vary for either protein consists of: 

Parental Variegated Allowed AA seqs/ 

Amino acid Codon AmAPQ acids PN A se ff B 

10 A5 RVT ADGTNS 6/6 

pg VYT PTALIV 6/6 

E 7 RRS EDNKSRG* 7/8 

T 8 VHG TPALMVQKE 9/9 

VHG ATPLMVQKE 9/9 

15 AlO RMG AEKT 4/4 

K12 VHG KQETPALMV 9/9 

Q16 NNG L*R*S.WPQMTKVAEG 13/15 

This provides 9.55«10 6 amino-acid sequences encoded by 
20 1.26«10 7 DNA sequences. A library comprising 5.0-10 7 
transformants allows expression of 98.2% of all possible 
sequences. At each position, the parental amino acid is 
allowed. 

At position 5 we provide amino acids that are compati- 
25 ble with a turn. At position 6 we allow ILE and VAL because 
they have branched jS carbons and make the chain ridged. At 
position 7 we allow ASP, ASN, and SER that often appear at 
the amino termini of helices. At positions 8 and 9 we allow 
several helix- favoring amino acids (ALA, LEU, MET, GLN, GLU, 
30 and LYS) that have differing charges and hydrophobicities 
because these are part of the helix proper. Position 10 is 
further around the edge of the helix, so we allow a smaller 
set (ALA, THR, LYS, and GLU) . This set not only includes 3 
helix- favoring amino acids plus THR that is well tolerated 
35 but also allows positive, negative, and neutral hydrophilic. 
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The side groups of 12 and 16 project into the Bame region as 
the residues already recited. At these positions we allow 
a wide variety of amino acids with a bias toward helix- 

favoring amino acids. 
5 The parental micro -protein may instead be a polypeptide 

composed of residues 9-24 and 31-40 of aprotinin and 
possessing two disulfides (Cys9-Cys22 and Cysl4-Cys38) . 
Such a polypeptide would have the same disulfide bond 
topology as a-conotoxin, and its two bridges would have 

10 spans of 12 and 17, respectively. 

Residues 23, 24 and 31 are variegated to encode the 
amino acid residue set [G,S,R,D,N,H,P,T,A] so that a 
sequence that favors a turn of the necessary geometry is 
found. We use trypsin or anhydrotrypsin as the affinity 

15 molucule to enrich for GPs that display a micro-protein that 
folds into a stable structure similar to BPTI in the PI 
region. 

Thr— P l«»nH.«te Bond Parental Micro- proteins . 

The cone snails f Conus ) produce venoms (conotoxins) 

20 which are 10-30 amino acids in length and exceptionally rich 
in disulfide bonds. They are therefore archetypal micro- 
proteins. Novel micro-proteins with three disulfide bonds 
may be modelled after the p- (GIIIA, GIIIB, GIIIC) or 
0-(GVIA, GVIB, GVIC, GVIIA, GVIIB, MVIIA, M7IIB, g£cj 

25 conotoxins. The M - conotoxins have the following conserved 
structure: 

12 3 1' 2-3' 

(2 AAS)-C-C-(S AAs )-C-(4 AAs) -C - (4 AAs) -C-C-AA 
30 L ' 



35 



No 3D structure of a M -conotoxin has been published. 
Hidaka e£ sL. (HIDA90) have established the connectivity of 
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the disulfides. The following diagram depicts geographu- 
toxin I (also known as fi-coriotoxin GIIIA) . 



Rl 



10 



15 



20 



D2 



T5 



\ /K16 — P17 

C3 : : C15 \ 

\ Q18 

\ -R19 1 

C4::C20- \ 

/ 



/ 
P7 



P6 



Q14 



C10::C21 R13 
I / | L A22 | 

K8-K9 Kll D12 



25 



30 



35 



The connection from R19 to C20 could go over or under the 
strand from Q14 to C15. One preferred form of variegation 
is to vary the residues in one loop. Because the longest 
loop contains only five amino acids, it is appropriate to 
also vary the residues connected to the cysteines that form 
the loop. For example, we might vary residues 5 through 9 
plus 2, 11, 19, and 22. Another useful variegation would be 
to vary residues 11-14 and 16-19, each through eight amino 
acids. Concerning fi conotoxins, see BECK89b, BECK89C, 
CRUZ89, and HIDA90. 

The 0- conotoxins may be represented as follows: 

! 2 3 1' 2- 3' 

C-(6 AAs)-C-(6 AAs)-C-C-(2-3 AAs)-C-(4-6 AAs) -C 
I — I -I — 1 



40 



The King Kong peptide has the same disulfide arrangement as 
the 0- conotoxins but a different biological activity. 
Woodward fit aL,. (WOOD90) report the sequences of three 
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10 



15 



homologuous proteins from SL. JifiStilS- Within the mature 
toxin domain, only the cysteines are conserved. The spacing 
of the cysteines is exactly conserved, but no other position 
has the same amino acid in all three sequences and only a 
few positions show even pair-wise matches. Thus we conclude 
that all positions (except the cysteines) may be substituted 
freely with a high probability that a stable disulfide 
structure will form. Concerning 0 conotoxins, see HILL89 
and SUNX87. 

Another micro-protein which may be used as a parental 
binding domain is the mnirbifra maxima trypsin inhibitor I 
(CMTI-I) ; CMTI-III is also appropriate. They are members of 
the squash family of serine protease inhibitors, which also 
includes inhibitors from summer squash, zucchini, and 
cucumbers (WIEC85) . McWherter e£ (MCWH89) describe 

synthetic sequence-variants of the squash-seed protease 
inhibitors that have affinity for human leukocyte elastase 
and cathepsin 6. Of course, any member of this family might 
be used. 

CMTI-I is one of the smallest proteins known, compris- 
ing only 29 amino acids held in a fixed comfonnation by 
three disulfide bonds. The structure has been studied by 
Bode and colleagues using both X-ray diffraction (B0DE89) 
and NMR (HOLA89a,b) . CMTI-I is of ellipsoidal shape; it 
lacks helices or /3-sheets, but consists of turns and 
connecting short polypeptide stretches. The disulfide 
pairing is Cys3-Cys20, Cysl0-Cys22 and Cysl6-Cys28. In the 
CMTI-I: trypsin complex studied by Bode g£ aL,., 13 of the 29 
inhibitor residues are in direct contact with trypsin; most 
30 of them are in the primary binding segment Val2 (P4) -Glu9 
(P4-) which contains the reactive site bond Arg5 (PI) -He6 
and is in a conformation observed also for other serine 
proteinase inhibitors. 



20 



25 
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CMTI-I has a K| for trypsin of -1.5-10" 12 M. McWherter 
e£ al. suggested substitution of "moderately bulky hydropho- 
bic groups" at PI to confer HLE specificity. They found 
that a wider set of residues (VAL # ILE, LEU, ALA, PHE, MET, 
5 and GLY) gave detectable binding to HLE. For cathepsin G, 
they expected bulky (especially aromatic) side groups to be 
strongly preferred. They found that PHE, LEU, MET, and ALA 
were functional by their criteria; they did not test TRP, 
TYR, or HIS. (Note that ALA has the second smallest side 
10 group available.) 

A preferred initial variegation strategy would be to 
vary some or all of the residues ARG lf VALj, PR0 4 , ARG,, ILE 6 , 
LEU,, METj, GLU 9 , LYS,„ HISj,/ GLYjj, TYR^, and GLY W . If the 
target were HNE, for example, one could synthesize DNA 
15 embodying the following possibilities: 

vg Allowed #AA segs/ 

p«rpnt a i codon amino acids ttPNA segs 

ARd VNT RSLPHITNVADG 12/12 

VALj NWT VILFYHND 8/8 

20 PR0 4 VYT PLTIAV 6/6 

ARGj VNT RSLPHITNVADG 12/12 

ILE6 NNK all 20 20/31 

LEU, VWG LQMKVE 6/6 

TYRj, NAS YHQNKDE. 7/8 

25 

This allows about 5.81-10 6 amino-acid sequences encoded by 
about 1.03-10 7 DNA sequences. A library comprising 5.0-10 7 
independent transf ormants would give -99% of the possible 
sequences. Other variegation schemes could also be used. 

30 Other inhibitors of this family include: 

Trypsin inhibitor I from citrullus vulgaris (OTLE87) , 
Trypsin inhibitor II from Bryonia dioica (0TLE87) , 
Trypsin inhibitor I from Cucurbita maxima (in 0TLE87) , 
trypsin inhibitor III from Cucurbita maxima (in OTLE87) , 

35 trypsin inhibitor IV from Cucurbita maxima (in OTLE87) , 
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trypsin inhibitor II from fhicurbita ESEfi ( in OTLE87) , 
trypsin inhibitor III from Cucurbita p?po (in OTLE87) , 
trypsin inhibitor lib from Cucumis satjvys (in 0TLE87) , 
trypsin inhibitor IV from Cucumis gatjvus (in OTLE87) , 
5 trypsin inhibitor II f r-nm Ecballium elaterium (FAVE89) , and 
inhibitor CM-1 from Momordica reoens (in 0TLE87) . 

Another micro-protein that may be used as an initial 
potential binding domain is the heat- stable enterotoxins 
derived from some enterotoxogenic JL. coli, cfttrpbactex 

10 fsgujidii, and other bacteria (GUAR89) . These micro-proteins 
are known to be secreted from JL. CPU and are extremely 
stable. Works related to synthesis, cloning, expression and 
properties of these proteins include: BHAT86, SEKI85, 
SHIM87, TAKA85, .TAKE90, THOM85a,b, YOSH85, DALL90, DWAR89, 

15 GARI87, GUZM89, GU25M90, H0UG84, KUB089, KUPE90, 0KAM87, 
OKAM88, and OKAM90. 

EXAMPLE IV 

A MINI - PROTEIN HAVING A CROSS-LINK CONSISTING OP CU(II) , ONE 

20 CYSTEINE f TWO HISTIDINES, AND ONE METHIONINE. 

Sequences such as 
HIS-ASN-GLY-MET-Xaa-Xaa-Xaa-Xaa-Xaa-Xaa-HIS-ASN-GLY-CYS and 

CYS-ASN-GLY-MET-Xaa-Xaa-Xaa-Xaa-Xaa-Xaa-HIS-ASN-GLY-HISare 
likely to combine with Cu(II) to form structures as shown in 
25 the diagram: 
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xaa7 Xaa8 Xaa7 Xaa8 

/ \ / \ 

Xaa6 Xaa9 Xaa6 Xaa9 

5 | I II 

Xaa5 XaalO Xaa5 XaalO 

\ / \ / 

MET4 HIS11 MET4 HIS11 

10 / \/\ / \ / \ 

GLY3 Cu ASN12 GLY3 Cu ASN12 

I / \ I | / \ I 

ASN2-HIS1 CYS14-GLY13 ASN2-CYS1 HIS14-GLY13 

II II 
15 NH: COO NHj COO 

Other arrangements of HIS, MET, HIS, and CYS along the chain 
are also likely to form similar structures. The amino acids 

20 ASN-GLY at positions 2 and 3 and at positions 12 and 13 give 
the amino acids that carry the metal -binding ligands enough 
flexibility for them to come together and bind the metal. 
Other connecting sequences may be used, e.g, GLY-ASN, SER- 
GLY, GLY-PRO, GLY- PRO- GLY , or PRO-GLY-ASN could be used. It 

25 is also possible to vary one or more residues in the loops 
that join the first and second or the third and fourth 
metal-binding residues. For example, 

Xaa8 Xaa9 

30 / \ 

Xaa7 XaalO 

I I 
Xaa6 Xaall 

\ / 

35 | MET5 HIS12 

Xaa4 \ / \ 

I \ / \ 

PR03 Cu ASN13 

\ / \ I 

40 GLY2-HIS1 CYS 15— GLY14 

II 
NHj COO 
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is likely to form the diagrammed structure for a wide 
variety of amino acids at Xaa4. It is expected that the 
side groups of Xaa4 and Xaa6 will be close together and on 
the surface of the mini -protein. 
5 The variable amino acids are held so that they have 

limited flexibility. This cross -linkage has some differ- 
ences from the disulfide linkage. The separation between 
and C M is greater than the separation of the Cji of a 
cystine. In addition, the interaction of residues 1 through 

10 4 and 11 through 14 with the metal ion are expected to limit 
the motion of residues 5 through 10 more than a disulfide 
between rsidues 4 and 11. A single disulfide bond exerts 
strong distance constrains on the a carbons of the joined 
residues, but very little directional constraint on, for 

15 example, the vector from N to C in the main- chain. 

For the desired sequence, the side groups of residues 
5 through 10 can form specific interactions with the target. 
Other numbers of variable amino acids, for example, 4, 5, 7, 
or 3, are appropriate. Larger spans may be used when the 

20 enclosed sequence contains segments having a high potential 
to form or helices or other secondary structure that limits 
the conformational freedom of the polypeptide main chain. 
Whereas a mini -protein having four CYSs could form three 
distinct pairings, a mini-protein having two HISs, one MET, 

25 and one CYS can form only two distinct complexes with Cu. 
These two structures are related by mirror symmetry through 
the Cu. Because the two HISs are distinguishable, the 
structures are different. 

When such metal -containing mini -proteins are displayed 

30 on filamentous phage, the cells that produce the phage can 
be grown in the presence of the appropriate metal ion, or 
the phage can be exposed to the metal only after they are 
separated from the cells. 
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EXAMPLE V 

A MINI -PROTEIN HAVING A CROSS-LINK CONSISTING OF ZN(II) AND 

FOUR CYSTEINES 

A cross link similar to the one shown in Example XV is 
5 exemplified by the Zinc- finger proteins (GIBS88, GAUS87, 
PARR88, FRAN87, CHOW87, HARD90) . One family of Zinc-fingers 
has two CYS and two HIS residues in conserved positions that 
bind Zn ++ (PARR88, FRAN87, CHOW87, EVAN88 , BERG88, CHAV88) . 
Gibson fit al. (GIBS88) review a number of sequences thought 

10 to form zinc- fingers and propose a three-dimensional model 
for these compounds. Most of these sequences have two CYS 
and two HIS residues in conserved positions, but some have 
three CYS and one HIS residue. Gauss fit aL_ (GAUS87) also 
report a zinc- finger protein having three CYS and one HIS 

15 residues that bind zinc. Hard fit &Ll. (HARD90) report the 3D 
structure of a protein that comprises two zinc- fingers, each 
of which has four CYS residues. All of these zinc-binding 
proteins are stable in the reducing intracellular environ- 
ment. 

20 One preferred example of a CYS: :zinc cross linked mini- 

protein comprises residues 440 to 461 of the sequence shown 
in Figure 1 of HARD90. The resiudes 444 through 456 may be 
variegated. One such variegation is as follows: 
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Parental Allowed 



10 



15 



20 



SER444 
ASP445 
GLU446 
ALA447 
SER448 
GLY449 
CYS450 
HIS451 
TYR452 
GLY453 
VAL454 

LEU455 
THR456 



SER, ALA 
ASP, ASN, 
GLU, LYS, 
ALA, THR, 
SER, ALA 
GLY, SER> 
CYS, PHE, 
HIS, GLN, 
TYR, PHE, 
GLY, SER, 
VAL, ALA, 

LEU, HIS, 
THR. ILE. 



GLU, LYS 
GLN 

GLY, SER 



ASN, ASP 
ARG, LEU 

ASN, LYS, ASP, GLU 
HIS, LEU 
ASN, ASP 
ASP, GLY, SER, ASN, THR, ILE 

8 / 

ASP, VAL 4 / 
ASN. SER 



#AA / KDNA 

I 
I 
I 
I 
I 
I 
I 
7 
/ 
/ 



2 
4 
3 
4 
2 
4 
4 
6 
4 
4 



2 
4 
3 
4 
2 
4 
4 
6 
4 
4 



8 
4 

_4_ 



This leads to 3.77-10 7 DNA sequences that encode the same 
number of amino-acid sequences. A library having 1.0-10 8 
independent transformants will display 93% of the allowed 
sequences; 2.0-10 8 independent transformants will display 
99.5% of allowed sequences. 
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Table 2: Preferred Outer- Surface Proteins 



Genetic 
package 



Preferred 
Outer- Surface 
pyptSiln 



Reason for 



preference 
M13 



coat protein 



10 



gp III 



15 



a) exposed amino terminus, 
(gpVIII)b) predictable post- 

translational 

processing, 

c) numerous copies in 

virion . 

d) fusion data available 



a) fusion data available. 

b) amino terminus exposed. 

c) working example 
available. : _ 



20 



PhiX174 G protein a) known to be on virion 

exterior, 
b) small enough that 

the G-ipbd gene can 
replace H gene. 



25 



E. coli 



LamB 



a) fusion data available, 
h) non-essential. 



OmpC 



30 



a) topological model 

b) non-essential; abundant 

OmpAa) topological model 

b) non-essential; abundant 

c) homologues in other genera 
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0 m P 1 

a) topological model 

b) non-essential; abundant 

PhoEa) topological model 

b) non- essential ; a bu nda n t 

c) inducible 

a) no post-translational 
spores processing, 

b) distinctive sdequence 
that causes protein to 

localize in spore coat, 

c) non -essential, — . 

Same as f or CotC. 



a-, subtilis CotC 

10 



15 CotD 
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Table 10: Abundances obtained 
from various vgCodons 



A. Optimized fxS Codon, Restrained by [D]+[E] 



[K] + [R] 



10 



15 



20 



25 



1 
2 

3 



Amino 
acid. 



A 
D 
F 
H 
K 
H 
P 
R 
T 

SL 
atop 



.26 
.22 
.5 



.18 
.16 
.0 



guidance 



.26 
.40 
.0 



4.80% 
6.00% 
2.86% 
3.60% 
5.20% 
2.86% 
2.88% 
6.82% 
4.16% 

3.86* lfaa 



5,20% 



.30 
.22 
.5 

Amino 
acid 



C 
E 



L 
N 



V 
Y 



f 
X 
S 



Abundance 



2.86% 
6.00% 
6.60% 
2.86% 
6.82% 
5.20% 
3.60% 

7 ( n7.% mfaa 



[D] + [B] - [K] + [R] - -12 
ratio = Abun (W) /Abun (S) = 0 . 4074 



6.60% 
5.20% 



30 i Sl/tttW 

1 2.454 

2 6.025 

3 14.788 

4 36.298 
35 5 89.095 

6 218.7 

7 536.8 



(ratio)' 
.4074 
.1660 
.0676 
.0275 
.0112 
4.57-10- 3 
1.86-10- 3 



.9480 
.8987 
.8520 
.8077 
.7657 
.7258 
.6881 
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Table 10: Abundances obtained 
from various vgCodon 
(continued) 

5 B. Unrestrained, optimized 



10 





T 


c 


A 


6 


1 


.27 


.19 


.27 


.27 


2 


.21 


.15 


.43 


.21 


3 


.5 


.0 


.0 


.5 



15 



20 



25 



30 



35 



40 



Amino 
acid 


Abundance 


Amino 
acid Abundance 


A 


4.05% 


C 


2.84% 


D 


5.81% 


E 


5.81% 


F 


2.84% 


6 


5.67% 


H 


4.08% 


I 


2.84% 


K 


5.81% 


L 


6.83% 


M 


2.84% 


N 


5.81% 


P 


2.85% 


Q 


4.08% 


R 


6.83% 


s 


6.89% mfaa 


T 


4.05% 


V 


5.67% 


W 


2.84% lfaa 


Y 


5.81% 


Stop 


5.81% 






[D] + 


[E] = 0.1162 [K] 


+ [R] - 0 


.1264 


ratio 


= Abun(W)/Abun(S) 


= 0.41176 




i 


(l/ratio)i 


(ratiQ) 1 ' 


stop- free 


i 


2.4286 


.41176 


.9419 


2 


5.8981 


.16955 


.8872 


3 


14.3241 


.06981 


.8356 


4 


34.7875 


.02875 


.7871 


5 


84.4849 


.011836 


.74135 


6 


205.180 


.004874 


.69828 


7 


498.3 


2.007-10- 3 


.6577 
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Table 10: Abundances obtained 
from various vgCodon 
(continued) 



C. Optimized NNT 



10 



l 
2 
3 



.2071 .2929 .2071 .2929 
.2929 .2071 .2929 .2071 
1. .0 .0 .0 



15 



20 



25 



Amino 
_&£i£L 



A 

D 

F 

H 

K 

M 

P 

R 

T_ 

w 

stop 



abundance 



4.29* lfaa 



none 
none 



Amino 
acid 



V 
Y 



t> . uot 
8.58% 


E 


none 


6.06% 


6 


6.06% 


8.58% 


I 


6.06% 


none 


L 


8.58% 


none 


N 


6.06% 


6.06% 


Q 


none 


6.06% 


s 


fl.58% mfaa 



8.58% 
6.06% 



i fl /ratio) i 

1 2.0 

30 2 4.0 

3 8.0 

4 16.0 

5 32.0 

6 64 . 0 
35 7 128.0 



(ration 
.5 
.25 
.125 
.0625 
.03125 
.015625 
.0078125 



Atop-free 
1. 
1. 
1. 
1. 
1. 
1. 
1. 
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Table 10: Abundances obtained 
from various vgCodon 
( continued) 



5 

D. Optimized NNG 



10 





T 


c 


A 


G 


1 


.23 


.21 


.23 


.33 


2 


.215 


.285 


.285 


.215 


3 


.0 


.0 


.0 


1.0 



15 



20 



25 



Amino 
acid 



A 
D 
F 
H 
K 
M 
P 
R 
T 
SL 
stop 



Ahundance 



Amino 
acid 



9.40% 
none 
none 
none 
6.60% 
4.90% 
6.00% 
9.50% 
6.6 % 

A. 90% lfaa 



C 
E 
G 
I 
L 



N 
Q 
S 
V 
Y 



Ahundance 
none 
9.40% 
7.10% 
none 

9 T S0% mfaa 



none 

6.00% 

6.60% 

7.10% 

none 



6.60% 



1 fl /ratio) J 
30 1 1.9388 

2 3.7588 

3 7.2876 

4 14.1289 

5 27.3929 
35 6 53.109 

7 102.96 



(ratip) J 
.51579 
.26604 
.13722 
.07078 
3.65-10- 1 
1.88-10' 2 
9 . 72 • 10' 3 



stop- free 
0.934 
0.8723 
0.8148 
0.7610 
0.7108 
0.6639 
0.6200 
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Table 10 : Abundances obtained 
from optimum vgCodon 
(continued) 



E. Unoptimized NNS (NNK gives identical distribution) 





T 


c 


A 


. G 


1 


.25 


.25 


.25 


.25 


2 


.25 


.25 


.25 


.25 


3 


.0 


.5 


.0 


0.5 



15 



20 



25 



30 



35 



Amino 
acid 



Abundance 



A 6.25% 

D 3.125% 

F 3.125% 

H 3.125% 

K 3.125% 

M 3.125% 

P 6.25% 

R 9.375% 

T 6.25% 

W 3.125% 

stop 3.125% 



i (i /rat APT 

1 3.0 

2 9.0 

3 27.0 

4 81.0 

5 243.0 

6 729.0 

7 2187.0 



Amino 
acid 



C 
E 
6 
I 
L 
N 
Q 
S 
V 
Y 



Abundance 
3.125% 
3.125% 
.25% 
.125% 
.375% 
.125% 
.125% 
9.375% 
6.25% 
3.125% 



6, 
3. 
9. 
3. 
3. 



ixafciojL' 

.33333 

.11111 

.03704 

.01234567 

.0041152 
1.37-10" 3 
4. 57-10-* 



stop- free 
.96875 
.9385 
.90915 
.8807 
.8532 

.82655 

.8007 
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<D 
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CO 



0) 02 

o ts 
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u 



4J 
CO 



u g to 

^12 



.5 

03 



CO 
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CN 
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S 



CD 
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0)0 
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Table 130: Sampling of a Library encoded by (NNK) 6 
A. Numbers of hexapeptides in each class 

total = 64,000,000 stop-free sequences. 



10 



a can be 


one 


Of [WMFYCIKDENHQ] 






* can be 


one 


Of [PTAVG] 








Q can be 


one 


Of [SLR] 








aaaaaa 




2985984. 


$aaaaa 




7464960. 


Qacictota 




4478976. 


**aaaa 




7776000. 






9331200. 


QQaaaa 




2799360. 






4320000. 


**Qaaa 




7776000. 






4665600. 


QQQaaa 




933120. 






1350000. 


*#*Qaa 




3240000. 


MQQofOf 




2916000. 


«QQQaa 




1166400. 






174960. 


4HMa 




225000. 






675000. 


***QQa 




810000. 




!S 


486000. 


$QQQQa 




145800. 


QGGQQar 




17496. 






15625. 






56250. 


*«**Q0 




84375. 


***GQG 




67500. 


**DQDD 




30375. 


fcOOQGO 




7290. 


QQQQQQ 




729. 



**QQaa, for example, stands for the set of peptides having 
two amino acids from the a class, two from <X>, and two from 
0 arranged in any order. There are, for example, 729 = 3 
sequences composed entirely of S, L, and R. 

30 
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Table 130: Sampling of a Library encoded by (NNK) 6 

(continued) 

Probability that any given stop- free DNA 
sequence will encode a hexapepta.de from a 
stated class. 



10 



15 



20 



25 



30 



35 







p 




% of class 


UuQtUVtC * • • 


3 


.364E- 


03 


(1 


.13E- 


07) 


VuCXUCUtiX * * • 


1 


.682E- 


02 


(2 


.25E- 


07) 


O/V/V/V/VfV 


1 


.514E- 


02 


(3 


.38E- 


07) 


AAry/y/yry 


3 


.505E- 


02 


(4 


.51E- 


07) 


AO/v/VfWV 
yuUUUU * • • 


6 


.308E- 


02 


(6 


.76E- 


07) 


O O /v /v/v/v 
uwUUUU • • • 


2 


.839E- 


02 


(1 


.01E- 


06) 


AAA /v/V/V 


3 


.894E- 


02 


(9 


.01E- 


07) 


WuUUU * * • 




051E- 


01 


(1 


.35E- 


06) 


VWuuuu* • • 


Q 


.463E- 


02 


(2 


.03E- 


06) 


QUUOfOfQC* • . 


9 




02 


(3 


.04E- 


06) 










(1 


.80E- 


06) 


AAAO/V/v 


8 


.762E- 


•02 


(2 


.70E- 


06) 


**QQaa — 


1 


.183E- 


•01 


(4 


• 06E- 


06) 


$QQQaa — 


7 


. 097E- 


•02 


(6 


.08E- 


06) 


QQQQCKK. . . 


1 


.597E- 


■02 


(9 


.13E- 


06) 




8 


.113E- 


■03 


(3 


.61E- 


06) 


. . . 


3 


.651E- 


■02 


(5 


.41E- 


■06) 


3><t4>QQa. . . 


6 


.571E- 


-02 


(8 


.11E- 


•06) 


**QQQa( 


5 


.914E- 


-02 


(1 


• 22E- 


•05) 


*QQQ0a 


2 


.661E- 


■02 


(1 


• 83E- 


■05) 


QQQQQa 


4 


.790E- 


•03 


(2 


.74E- 


•05) 


AAAAAA, . # 


1 


.127E- 


-03 


(7 


.21E- 


-06) 




6 


.084E- 


-03 


(1 


.08E- 


■05) 


**$*QQ 


1 


.369E- 


-02 


(1 


.62E- 


-05) 


AAAQQQ . . . 


1 


.643E 


-02 


(2 


.43E- 


■05) 


$$QQ0Q . . . 


1 


.109E 


-02 


(3 


.65E- 


-05) 


3>QQQGQ . . . 


3 


.992E 


-03 


(5 


.48E- 


-05) 


QQQQQQ 


5 


.988E 


-04 


(8 


.21E- 


-05) 
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Table 130: Sampling of a Library encoded by (NNK) 6 

(continued) 

Number of different stop- free amino -acid 
sequences in each class expected for various 
library sizes 



Library size = 



1.0000E+06 



10 



15 



20 



25 



total 
Class 



aaaaaa — 
Qaaaaa. . . 
*Qaaaa. . . 
***aaa — 
*QQaaa. . . 
****aa. . . 
**QQaa. . . 
QQQQaa. . . 
MHQot. . . 

**QQQa 

QQQQQa 

****4>Q. . . 
***QQQ. . . 
#QQQQQ. . • 



9.7446E+05 % sampled 



1.52 



Number 



Jl 



3362. 6( 
15114.6 ( 
62871.1 ( 
38765.7 ( 
93672. 7 ( 
24119.9 ( 
115915.5 ( 
15261.1 ( 
35537.2 ( 
55684.4 ( 
4190.6 ( 
5767.0 ( 
14581.7 ( 
3073.9 ( 



.1) 
.3) 
.7) 
.9) 
2.0) 
1.8) 
4.0) 
8.7) 
5.3) 
11.5) 
24.0) 
10.3) 
21.6) 
42.2) 



ClaBB 



Number 



*aaaaa . . . 
**aaaa . . . 
QQaaaa — 
#*Qaaa. . . 
QQQaaa. . . 
***Qaa . . . 
$QQQaa. . . 
****$a. . . 
***QQa. . . 
*QQQQa. . . 
. . 

****CD. . . 
**QQQQ. . . 
QQQQQQ. . . 



16803.4 ( 
34967. 8( 
28244.3 ( 
104432.2 ( 
27960. 3( 
86442.5 ( 
68853. 5( 
7968. 1( 
63117. 5( 
24325. 9.( 
1087. 1( 
12637.2 ( 
9290. 2( 
408. 4( 



.2) 
.4) 
1.0) 
1.3) 
3.0) 
2.7) 
5.9) 
3.5) 
7.8) 
16.7) 
7.0) 
15.0) 
30.6) 
56.0) 



Library size - 



3.0000E+06 



30 



35 



40 



45 



total = 

aaaaaa. 

Qaaaaa. . 

4>flaaaa. . 

***aaa. , 

GQQaaa. 

****aa. 

**QQaa. 

QQQQaa . 

****Qa. 

**QQQa. 

QDQQQa. 

***QQQ. 
«QQQQQ. 



2.7885E+06 % sampled = 4.36 



.4( 

• 9( 
.5( 

• 6( 
.9( 



10076 

45190 
187345 
115256 
275413 

71074.5 ( 

334106.2 ( 
41905.9 ( 

101097.3 ( 
148643. 7 ( 

9801. 0( 
15587. 7( 
34975. 6 ( 

5879.9 ( 



.3) 
1.0) 
2.0) 
2.7) 
5.9) 
5.3) 
11.5) 
24.0) 
15.0) 
30.6) 
56.0) 
27.7) 
51.8) 
80.7) 



*aaaaa. . . 
**aaaa. . . 
QQaaaa. . . 
*<l>Qaaa . . . 
QQQaaa. . . 
<M>*Qaa. . . 
*QQQaa. . . 
***<M>a. . . 

«$$QQa 

«QQQQa. . . 
. . 

****QQ. . . 
«4>QQQQ. . . 
QQQQQQ. . . 



50296.9 ( 
104432.2 ( 

83880. 9( 
309107.9 ( 

81392. 5( 
252470.2 ( 
194606.9 ( 

23067.8 ( 
174981.0 { 

61478.9 ( 
3039.6 ( 

32516.8 ( 

20215. 5( 
667. 0( 



.7) 
1.3) 
3.0) 
4.0) 
8.7) 
7.8) 
16.7) 
10.3) 
21.6) 
42.2) 
19.5) 
38.5) 
66.6) 
91.5) 
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Table 130: 



Sampling of a Library encoded by (NNK) 
(continued) 



10 



15 



20 



25 



30 



35 



40 



Library size = 



1.0000E+07 



total = 

aaaaaa. . , 
Qaaaaa. . . 
$Qaaaa. . 
***aaa. . 
*QQaaa. . 
****aa. . 
**QQaa. . 
QQQQaa. . 
****Qa. . 
«$QQQa. . 
QQQQQa. . 
«****Q. . 
$$$QQQ . . 
$QQQQQ. . 



8.1204E+06 % sampled 



12.69 



33455.9 
148871.1 
609987.6 
372371.8 
856471.6 
222702.0 
972324.6 
104722.3 
281976.3 
342072.1 
16364.0 
37179.9 
61580.0 
7259.5 



1.1) 
3.3) 
6.5) 
8.6) 
18.4) 
16.5) 
33.3) 
59.9) 
41.8) 
70.4) 
93.5) 
66.1) 
91.2) 
99.6) 



*aaaaa — 
**aaaa. . . 
QQaaaa. . . 
$$Qaaa. . . 
QQQaaa. . . 
**#Qaa. . . 
4>QQQaa. . . 

$$$QQa — 
$QQQQa. . . 

****QQ. . . 
**QQQQ . . . 
QQQQQQ . . . 



166342 
342685 
269958 
983416 
244761 
767692 
531651 
68111 
450120 
122302 
8028 
67719 
29586 
728 



.4( 
.7( 
.31 
.4( 
,5( 
.5( 
.3( 
.0( 
.2( 
.6( 
.0( 
.5( 



2.2) 
4.4) 
9.6) 
12.6) 
26.2) 
23.7) 
45.6) 
30.3) 
55.6) 
83.9) 
51.4) 
80.3) 
97.4) 



.8(100.0) 



3.0000E+07 



Library size = 
total = 1.8633E+07 % sampled = 29.11 



aaaaaa. . . 
Qaaaaa — 
*Oaaaa — 
***aaa. . . 
*QQaaa — 
****aa. . . 
$4>QOaa. . . 
QQQQaa . . . 
HMQa. . . 
MQQQa. . . 
QQQQQa. . . 
««$$«Q. . . 
<M>4>QQQ. . . 
«QQQQQ. . . 



99247 
431933 
1712943 
1023590 
2126605 
563952 
2052433 
163640 
541755 
473377 
17491 
54058 
67454 
7290 



3.3) 
9.6) 
18.4) 
23.7) 
45.6) 
41.8) 
70.4) 
93.5) 
80.3) 
97.4) 
.3(100.0) 
.1( 96.1) 
.5( 99.9) 
.0(100.0) 



4( 
3( 
0( 
0( 
0( 
6( 
0( 
3( 
7( 
• 0( 



*aaaaa — 
•roaaaa — 
QQaaaa — 

**Qaaa 

QQQaaa — 
***Qaa — 
*QQQaa. . . 
HHto. . . 
$«4>QQa. . . 
*QQQQa. . . 

. . 

**4>*QQ . . . 
MQQQQ . . • 
QQQQQQ — 



487990 
983416 
734284 
2592866 
558519 
1800481 
978420 
148719 
738960 
145189 
13829 
83726 
30374 
729 



6.5) 
12.6) 
26.2) 
33.3) 
59.9) 
55.6) 
83.9) 
66.1) 
91.2) 
99.6) 
88.5) 
99.2) 
.5(100.0) 
.0(100.0) 



.0( 
.5( 
.6( 

• 0( 
.0( 
.0( 
.5( 
.7( 

• K 
.7( 

• K 

• 0( 
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Table 130: Sampling of a Library encoded by (NNK) 6 

(continued) 



Library size 



7.6000E+07 



total = 3.2125E+07 % sampled 



50.19 



10 



15 



20 



25 



30 



35 



40 



aaaaaa. . 

Qaaaaa. . 

*Qaaaa. . 

***aaa. 

*QQaaa. . 

*4>**aa. . 

$$QQaa. 

QQQQaa. 

****Qa. 

**QQQa. 

QQQQQa. 

*4>***Q. 

*#*QQQ. 

♦QQQQQ. 



245057 
1014733 
3749112 
2142478 
3666785 
1007002 
2782358 
174790 
663929 
485953 
17496 
56234 
67500 
7290 



Library size 



.8( 8.2) 
.0( 22.7) 
.0( 40.2) 
.0( 49.6) 
.0( 78.6) 
.0( 74.6) 
.0( 95.4) 
.0( 99.9) 
.3( 98.4) 
.2(100.0) 
.0(100.0) 
.9(100.0) 
.0(100.0) 
.0(100.0) 

1.0000E+08 



*aaaaa. . . 
**aaaa. . . 
QQaaaa. . . 
**Qaaa. . . 
QQQaaa. . . 
***Qaa. . . 

$QQQaa 

WH40! 

***QDa. . . 

#QQQQa 

****** . . . 
4>4>**DD . . . 
**QQQQ. . . 
QQQQQQ. . . 



1175010. 
2255280. 
1504128. 
4993247. 

840691. 
2825063. 
1154956. 
210475. 
808298. 
145799. 
15559. 
84374. 
30375. 
729. 



total ■ 

aaaaaa. 
Qaaaaa. 
*Qaaaa. 
***aaa. 
*QQaaa. 
****aa. 
**QQaa. 
QQQQaa. 
****Qa. 
**QQQa. 

QQQQQa. 
*****Q. 
***QQQ. 
«QQQQQ. 



. 318185. 
. 1284677. 
. 4585163. 
.2566085. 
. 4051713. 
. 1127473. 
. 2865517. 
. 174941. 
. 671976. 
. 485997. 

17496. 

56248. 
, . 67500. 
7290. 



K 
0( 
0( 
0( 
0( 
0( 
0( 



10.7) 
28.7) 
49 . 1) 
59.4) 
86.8) 
83.5) 
98.3) 
0(100.0) 
9( 99.6) 
5(100.0) 
0(100.0) 
9(100.0) 
0(100.0) 
0(100.0) 



*aaaaa. . . 
**aaaa. . . 
QQaaaa. . . 
$$Qaaa . . . 
QQQaaa . . . 
<x>$«Qaa. . . 
4>QQQaa . . , 
*****a. . , 
4>**QQa. . 
QQQQQa. . 
«3&*****. . 
****QQ. . 
**QQQQ. . 
QQQQQQ. . 



15.7) 
29.0) 
53.7) 
64.2) 
90.1) 
87.2) 
99.0) 
93.5) 
99.8) 
9(100.0) 
9( 99.6) 
6(100.0) 
0(100.0) 
0(100.0) 



0( 
0( 
0( 
0( 
9( 
0( 
0( 
6( 
6( 



3.6537E+07 % sampled - 57.09 



1506161 
2821285 
1783932 
5764391 
888584 
3023170 
1163743 
218886 
809757 
145800 
15613 
84375 
30375 
729 



20.2) 
36.3) 
63.7) 
74.1) 
95.2) 
93.3) 
99.8) 
97.3) 
.3(100.0) 
.0(100.0) 
.5( 99.9) 
.0(100.0) 
.0(100.0) 
.0(100.0) 



0( 
0( 
0( 
0( 

• 3( 
■ 0( 
.0( 

• 6( 
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Table 130: Sampling of a Library encoded by (NNK)' 

(continued) 



10 



15 



20 



Library size = 



3.0000E+08 



total = 5.2634E+07 % sampled = 82.24 



aaaaaa. 
Qaaaaa. 
*Qaaaa. 
***aaa. 
«QQaaa. 
****aa. 
«*QQaa. 
QQQQaa. 
****Qa. 
«$QQQa. 

QQQQQa. 
***$*Q. 
***QQQ. 
*QQQQQ. 



856451 
2854291 
8103426 
4030893 
4654972 
1343954 
2915985 
174960 
674999 
486000 
17496 
56250 
67500 
7290 



• 3( 

• 0( 

• 0( 

• 0( 

• 0( 
.0( 



28.7) 
63.7) 
86.8) 
93.3) 
99.8) 
99.6) 
.0(100.0) 
.0(100.0) 
.9(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 



*aaaaa. . . 
**aaaa. • • 
Qaaaaa. . . 
**Qaaa. . . 
QQQaaa. . . 
***Qaa. . . 
$QQQaa. . . 
mMa. . . 
***QQa. . . 
$QQQQa . . . 

. . 

*S**QQ. . . 
**QQQQ. . . 
QQQQQQ . . . 



3668130 
5764391 
2665753 
7641378 
933018 
3239029 
1166400 
224995 
810000 
145800 
15625 
84375 
30375 
729 



.0( 49.1) 
.0( 74.1) 
•0( 95.2) 
.0( 98.3) 
.6(100.0) 
.0(100.0) 
.0(100.0) 
.5(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 



Library size 



1.0000E+09 



25 



30 



35 



40 



total = 

aaaaaa. . 

Qaaaaa. . 

*Qaaaa. . 

***aaa. , 

*QQaaa. , 

****aa. 

**QQaa. 

QQQQaa. 

M440Q!. 

4>*QQQa. 

QQQQQa. 

***«*Q. 

***QQQ. 

♦QQQQQ. 



6.1999E+07 % sampled « 96.87 



2018278 
4326519 
9320389 
4319475 
4665600 
1350000 
2916000 
174960 
675000 
486000 
17496 
56250 
67500 
7290 



.0( 67.6) 
.0( 96.6) 
.0( 99.9) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 



*aaaaa. . . 
**aaaa. . . 
QQaaaa — 
**Qaaa. . . 
QQQaaa. . . 
***Qaa. . . 
$QQQaa. . . 
*«***a. . . 
***QQa. . . 
$QQQQa. . . 

. . 

****ao. . . 

**QQQQ. . . 
QQQQQQ. . . 



6680917. 
7690221. 
2799250. 
7775990. 

933120. 
3240000. 
1166400. 
225000. 
810000. 
145800. 
15625. 
84375. 
30375. 
729. 



0( 89.5) 
0( 98.9) 
0(100.0) 
0(100.0) 
0(100.0) 
0(100.0) 
0(100.0) 
0(100.0) 
0(100.0) 
0(100.0) 
0(100.0) 
0(100.0) 
0(100.0) 
0(100.0) 
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Table 130: Sampling of a Library encoded by (NNK) 6 

(continued) 

Library size = 3.0000E+09 

5 



total = 


6.3890E+07 % 


sampled = 99 


.83 






aaaaaa. . . 


2884346 


.0( 96. 


6) 


$aaaaa. . . 


7456311 


• 0( 99 


.9) 


Qaaaaa. . . 


4478800 


.0 (100. 


0) 


**aaaa . . . 


7775990 


.0(100 


.0) 


*Qaaaa 


9331200 


.0 (100. 


0) 


QQaaaa. . . 


2799360 


.0 (100 


.0) 


***aaa . . . 


4320000 


.0 (100. 


0) 


**Qaaa. . . 


7776000 


.0 (100 


.0) 


$QQaaa . . . 


4665600 


.0(100. 


0) 


QQQaaa. . . 


933120 


.0(100 


.0) 


MMctQI. . . 


1350000 


.0(100. 


0) 


**#Qaa. . . 


3240000 


.0 (100 


.0) 


MQQaa 


2916000 


.0(100. 


0) 


SQQQaa. . . 


1166400.0(100 


.0) 


QQQQaa. . . 


174960 


.0(100. 


0) 


**##*a. . . 


225000 


.0(100 


.0) 


****Qa. . . 


675000 


.0(100. 


0) 


$$$QQa. . . 


810000 


.0(100 


.0) 


**nOQa 


486000 


.0(100. 


0) 


SQQQQa . . . 


145800 


.0(100 


.0) 


QQQQQa 


17496 


.0 (100. 


0) 


999999. . . 


15625 


.0 (100 


.0) 


*****n. . . 


56250 


.0 (100. 


0) 


9999QQ 


84375 


.0(100 


.0) 


***QQQ. . . 


67500 


.0 (100. 


0) 


**QQQQ. . . 


30375 


.0(100 


.0) 


«QQQQQ. . . 


7290 


.0(100. 


0) 


QQQQQQ. . . 


729 


.0(100 


.0) 
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Table 130, continued 
D. Formulae for tabulated quantities, 

5 Lsize is the number of independent transf ormants . 
31**6 is 31 to sixth power; 6*3 means 6 times 3. 
A = Lsize/ (31**6) 
ot can be one of [WMFYCIKDENHQ . ] 
* can be one of [PTAVG] 
10 Q can be one of [SLR] 

F0 = (12)**6 Fl = (12)**5 F2 = (12)**4 

F3 = (12)**3 F4 - (12)**2 F5 = (12) 

F6 = 1 



15 



20 



25 



30 



35 



40 



45 



ccototototot 

Qaaaaa 

<$>Qct<xaa 
QQaaaa 
<M*t>aacx 

*QQaata 
QQGaaaJ 

QQQQaof 

<M>DQQQf 
*QQQQa 
QQQQQa 

$$$$$$ 

<M>QQQQ 
*QQQQQ 
QQQQQQ 
total 



F0 * (1-exp ( -A)) 
6 * 5 * Fl * (l-exp( 
6 * 3 * Fl * 
(15) * 5**2 * 
(6*5)*5*3 *F2 
(15) * 3**2 * 
(20)* (5**3) * 



2*A) ) 
(l-exp(-3*A)) 
F2 * (l-exp(-4*A) ) 
* (l-exp(-6*A)) 

* (i-exp(-9*A) ) 

* (1-exp (-8*A)) 
(1-exp (-12*A)) 



F2 
F3 



(60)* (5*5*3) *F3* 
(60) * (5*3*3) *F3* (1-exp (-18*A) ) 
(20) * (3) **3*F3* (l-exp(-27*A) ) 
(15) * (5) **4*F4* (l-exp(-16*A) ) 
(60) * (5) **3*3*F4* (1-exp (-24*A) ) 
(90) * (5*5*3*3) *F4* (1-exp (-36*A) ) 
(60) * (5*3*3*3) *F4* (1-exp (-54*A) ) 
(15)*(3)**4 * F4 * (1-exp (-81*A) ) 
(g)*(5)**5 * F5 * (1-exp ( -32*A) ) 
30*5*5*5*5*3*F5* (1-exp ( -48*A) ) 
60*5*5*5*3*3*F5* (1-exp (-72*A) ) 
60*5*5*3*3*3*F5* (1-exp (-108*A) ) 
30*5*3*3*3*3*F5* (1-exp (-162*A) ) 
6*3*3*3*3*3*F5* (1-exp ( -243*A) ) 
5**6 * (1-exp ( -64 *A) ) 
6*3*5**5* (l-exp(-96*A) ) 
15*3*3*5**4* (1-exp (-144*A) ) 
20*3**3*5**3* (1-exp (-216*A) ) 
15*3**4*5**2* (1-exp (-324*A) ) 
6*3**5*5* (1-exp (-486*A) ) 
3**6* (1-exp (-729*A) ) 



aaaaaa + 
QQaaaa + 

+ 

QQQQQa + 
**QQQQ + 



Qotctaaot 

&&$>Qaa 

$$$$$$ 
*QQQQQ 



Qctototota 

**QQaf<x 
*<M>QQa 

QQQQQQ 



*QQQaaf 
4"X>QQQa 



QQQaaa + 

QQQQaa + 

*0QQQa + 

<£4>$QGQ + 
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10 



15 



Table 131: Sampling of a Library 
Encoded by (NNT) 4 (NNG) J 

X can be F, S, Y, C,L, P,H,R, I,T,N,V,A,D,6 

r can be L s ,R a ,S,W, P,Q,M, T,K,V,A,E,G 

Library comprises 8.55*10 6 amino-acid sequences; 1.47-10 7 DNA 
sequences. 

Total number of possible aa sequences* 8,555,625 



x 
S 
6 
Q 



LVPTARGFYCHIND 
S 

VPTAGWQMKES 
LR 



20 



25 



30 



35 



40 



The first, second, fifth, and sixth positions 
can hold x or S; the third and fourth position can hold 6 or 
Q. I have lumped sequences by the number of xs, Ss, 6s, and 
Qs. 

For example xxSQSS stands for: 

[xxGDSS, xSeQxS, xSOQSx, SSGQxx, SxOQxS, 
SxOQSx, 

xxQGSS, xSQ8xS, xSQSSx, SSQOxx, SxQOxS, SxQOSx] 

The following table shows the likelihood that 
any particular DNA sequence will fall into one of the 
defined classes. 



Library size = 



total . . 
xx66xx. 
xxQQxx. 
xxOQxS. 

xxeess. 

xxQQSS. 
xSOQSS. 

sseess. 

SSQOSS . 



1.0 

1.0000E+00 
3.1524E-01 
4.1684E-02 
1.3101E-01 
3.8600E-02 
5.1042E-03 
2.6736E-03 
1.3129E-04 
1.7361E-05 



Sampling = .00001% 



%sampled. 
xxBQxx. . . 
xx66xS. . . 
xxQQxS. . . 
xxSQSS . . . 

xseess. . . 

xSQQSS. . . 
SS8QSS . . . 



1.1688E-07 
2.2926E-01 
1.8013E-01 
2.3819E-02 
2.8073E-02 
3.6762E-03 
4. 8 6 HE -04 
9.5486E-05 



45 
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Table 131: Sampling of a Library 
Encoded by (NNT) 4 (NNG) 
(continued) 

The following sections show how many sequences 
of each class are expected for libraries of different sizes. 



10 



15 



20 



25 



30 



35 



40 



Library size = 



1.0000E+05 



total. 
Type 



9.9137E+04 
Number $_ 



fraction sampled = 1.1587E-02 
Type Number I — 



xx66xx. 
xxQQxx. 
xxOQxS. 

xxeess. 

xxQQSS. 

xseoss. 
sseess. 

SSQOSS. 



31416.9 ( 
4112.4 ( 
12924.6 ( 
3808. 1( 
483. 7( 
253. 4{ 
12.4 ( 
1.4C 



2 
2 
2 
10 
10 
10 
35 



7) 
7) 
7) 
7) 
3) 
.3) 
.3) 
.2) 



xxBQxx. 
xx6 9xS . 
xxQQxS. 
xxOQSS . 

xseess. 

xSQQSS . 
SS9QSS. 



22771.4 ( 
17891.8 ( 
2318.5 ( 
2732.5 ( 
357.8 ( 
43. 7( 
8.6( 



1.3) 
1.3) 
5.3) 
5.3) 
5.3) 
19.5) 
19.5) 



Library size = 



1.0000E+06 



total . . 
xx68xx. 
xxQQxx. 
xxOQxS. 

xxeess. 

xxQQSS. 

xseoss. 
sseess. 

SSQQSS. 



9.2064E+05 
304783.9 ( . 6.6) 

36508.6 ( 23.8) xx68xS. 
114741.4 ( 23.8) xxQQxS. 
33807. 7( 23.8) XX6QSS . 
3114.6 ( 66.2) xSeeSS. 
1631.5 ( 66.2) xSQQSS. 
80. 1( 66.2) SS6QSS, 
3.9( 98.7) 



fraction sampled = 1.0761E-01 
xxOQxx 214394.0 ( "~ " 



168452.5 ( 
18383. 8( 
21666.6 ( 
2837.3 ( 
198. 4( 
39. 0( 



12.7) 
12.7) 
41.9) 
41.9) 
41.9) 
88.6) 
88.6) 



Library size 



total . . 
xx66xx. 
xxQQxx. 
xxGQxS . 

xxeess. 

xxQQSS . 

xseoss. 
sseess. 

SSQQSS. 



3.0000E+06 

2.3880E+06 fraction sampled = 2.7912E-01 

855709. 5( 18.4) xxeOxx 565051. 6{ 33.4) 

85564.7 ( 55.7) xx66xS 443969.1 ( 33.4 

268917. 8( 55.7) xxQQxS 35281.3 { 80.4 

79234. 7( 55.7) XX6QSS 41581. 5( 80.4 

4522. 6( 96.1) XS98SS 5445. 2( 80.4) 

2369.0 ( 96.1) XSQQSS 

116.3 ( 96.1) SS6QSS 

4.0(100.0) 



223. 7( 99.9) 
43. 9( 99.9) 
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Table 131: Sampling of a Library 
Encoded by (NNT) 4 (NNG) 2 
( continued) 



Library size = 



8.5556E+06 



total 4.9303E+06 fraction sampled = 5.7626E-01 

xxGGxx 2046301. 0( 44.0) xxGQxx 1160645. 0( 68.7) 

xxQQxx 138575.9 ( 90.2) xxGGxS 911935.6 ( 68.7) 



10 xxGQxS 435524.3 ( 90.2) XXQQxS. 

xxBGSS 128324.1 ( 90.2) XX6QSS . 

xxQQSS 4703.6(100.0) xSGGSS, 

xSGQSS 2463.8(100.0) xSQQSS. 

sseess 121.0(100.0) sseoss. 

15 SSQQSS 4.0(100.0) 



43480. 7( 99.0) 
51245. 1( 99.0) 
6710. 7( 99.0) 
224.0 (100.0) 
44.0(100.0) 



20 



25 



Library size 



1.0000E+07 



total 5.3667E+06 fraction sampled = 6.2727E-01 

XX69XX 2289093.0 ( 49.2) xx6Qxx 1254877.0 ( 74.2) 

XxQQxx 143467. 0( 93.4) xx69xS 985974.9 ( 74.2) 



xxBQxS 450896.3 ( 93.4) xxQQxS. 

XX69SS 132853. 4( 93.4) XX0QSS. 

xxQQSS 4703.9(100.0) XS69SS. 

xSOQSS 2464.0(100.0) xSQQSS. 

sseess 121.0(100.0) sseoss. 

SSQQSS 4.0(100.0) 



43710. 7( 99.6) 
51516.1 ( 99.6) 
6746. 2( 99.6) 
224.0(100.0) 
44.0(100.0) 



30 



35 



Library size « 



3.0000E+07 



total 7.8961E+06 fraction sampled - 9.2291E-01 

xxGGxx 4040589.0 ( 86.9) xxGQxx 1661409.0 ( 98.3) 

xxQQxx 153619.1(100.0) xx66xS 1305393.0 ( 98.3) 



XXBQxS 482802.9(100.0) xxQQxS. 

XX99SS 142254.4(100.0) XX6QSS. 

xxQQSS 4704.0(100.0) XS66SS. 

xSQQSS..... 2464.0(100.0) xSQQSS . 

sseess 121.0(100.0) ssoqss. 

SSQQSS..... 4.0(100.0) 



43904.0 (100.0) 
51744.0 (100.0) 
6776.0 (100.0) 
224.0(100.0) 
44.0 (100.0) 
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Table 131: Sampling of a Library 
Encoded by (NNT) 4 (NNG) J 
(continued) 



10 



15 



20 



25 



Library size - 5.0000E+07 

total 8.3956E+06 fraction sampled = 9.8130E-01 

Sxeexx 4491779. 0( 96.6) xx9Qxx 1688387.0 99.9 

ScOQxx 153663.8(100.0) xx99xS 1326590.0 99.9 

MC0QXS..... 482943.4(100.0) xxQQxS 43904.0 100.0 

XX99SS 142295.8(100.0) XX90SS 51744.0 100.0 

xxQQSS 4704.0(100.0) XS99SS 6776.0 100.0 

XS90SS 2464.0(100.0) xSQQSS 224.0 100.0 

sseess 121.0(100.0) sseoss 44.0(100.0) 

SSQDSS 4.0(100.0) 



Library size = 



1.0000E+08 



total 8.5503E+06 fraction sampled = 9.9938E-01 

xxeex^c! ! ! ". ! 4643063.0 ( 99.9) xxOOxx 1690302.0(100.0) 

ScOOxx 153664.0(100.0) xx90xS 1328094,0 100. 0 

SfeSS 482944.0(100.0) xxQQxS 43904.0(100.0) 

XX96SS 142296.0(100.0) XX60SS 51744.0 100.0 

xxQOSS 4704.0(100.0) XS96SS 6776.0 100.0 

XS9DSS. . . . . 2464.0(100.0) xSOQSS 224.0 100.0 

SS99SS 121.0(100.0) SS9QSS 44.0(100.0) 

SSQQSS 4.0(100.0) 
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Table 132: Relative efficiencies of 
various simple variegation codons 



10 



vcrCodon 

NNK 

assuming 
stops vanish 

NNT 



15 



Number of codons 
_5 g 



#DNA/#AA #DNA/#AA #DNA/#AA 
t#DNA] [#DNA] [#DNA] 
<«AA) iMMl (ftAA) 



8.95 13.86 21.49 

[2 . 86-10 7 ] [8 . 87-10 8 ] [2.75 -10 10 ] 
(3 . 2-10 6 ) (6.4-10 7 ) (1.28-10 9 ) 

1.38 1.47 1.57 

[1.05-106] [1.68-10 7 ] [2.68-10*] 
(7. 59 -10 s ) (1.14-10 7 ) (1. 71-10') 



NNG 2.04 2.36 2.72 

assuming [7.59 -10 s ] [1.14-10 6 ] [1.71-10 8 ] 

20 stops vanish (3 . 7-10 5 ) (4 . 83-10 6 ) (6 . 27-10 7 ) 
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Table 155 

Distance in A between alpha carbons in octapeptides : 
5 Extended Strand: angle of C^-C^-C^ = 138° 



20 



25 



30 



35 



40 







1 


2 


3 


4 


5 




1 














2 




3.8 








10 


3 




7.1 


3.8 








4 


10.7 


7.1 


3.8 








5 


14.2 


10.7 


7.1 


3.8 






6 


17.7 


14.1 


10.7 


7.1 


3.8 




7 


21.2 


17.7 


14.1 


10.6 


7.0 


15 


8 


24.6 


20.9 


17.5 


13.9 


10.6 



3.8 

7.0 3.8 



Reverse turn between residues 4 and 5, 





1 


2 


3 


4 


5 


1 












2 




3.8 








3 




7.1 


3.8 






4 


10.6 


7.0 


3.8 






5 


11.6 


8.0 


6.1 


3.8 


3.8 


6 


9.0 


5.8 


5.5 


5.6 


7 


6.2 


4.1 


6.3 


8.0 


7.0 


8 


5.8 


6.0 


9.1 


11.6 


10.7 



3.8 

7.2 3.8 



Alpha helix: angle of C^-C^-C^ = 93« 





1 


2 


3 


4 


5 


1 












2 




3.8 








3 




5.5 


3.8 






4 


5.1 


5.4 


3.8 






5 


6.6 


5.3 


5.5 


3.8 




6 


9.3 


7.0 


5.6 


5.5 


3.8 


7 


10.4 


9.3 


6.9 


5.4 


5.5 


8 


11.3 


10.7 


9.5 


6.8 


5.6 



3.8 

5.6 3.8 
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Table 156 



Distances between alpha carbons in closed mini -proteins of 
the form disulfide cyclo (CXXXXC) 



Minimum distance 
10 12 3 4. 5 6_ 



1 












2 


3.8 










3 


5.9 


3.8 








4 


5.6 


6.0 


3.8 






5 


4.7 


5.9 


6.0 


3.8 




6 


4.8 


5.3 


5.1 


5.2 


3.8 



20 Average distance 



30 



1 

2 3.8 

25 3 6.3 3.8 

4 7.5 6.4 3.8 

5 7.1 7.5 6.3 3.8 

6 5.6 7.5 7.7 6.4 3.8 



Maximum distance 



35 


1 
2 
3 


3.8 
6.7 


3.8 










4 


9.0 


6.9 


3.8 








5 


8.7 


8.8 


6.8 


3.8 




40 


6 


6.6 


9.2 


9.1 


6.8 


3.8 
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Table 820: Peptide Phage 

Antibiotic 

Putative Streptavidin Resistance 
5 Name Binding Pep tide Sea. . ^ r}cer 

HPQ AEGPCHPQF- -CQSYIEGRIV- - - - E. . . 

DEV(F) A E - PCHPQYRLCQRPLKQPPPPPPAE... 

Dev(E) AE - LCHPQFPRCNLFRKVPPPPPPAE... 

10 HPQ6 AEGPCHPQFPRCYIEGRIV E... 

11111111112222222 
12345678901234567890123456 

C C = 

15 



WO 92/15677 



PCT/US92/01456 



123 

Table 838: Streptavidin-binding 
disulfide-constrained peptides 

clone aiu aiv v eva v v v v ova v ser Frequency 

5 #2 glu gly tyr cys his pro gin phe cys pro ser 4 

#4 glu gly his cys his pro gin phe cys ser ser 3 

#5 glu gly leu cys his pro gin phe cys gly ser 2 

#8 glu gly asp cys his pro gin. phe cys ser ser 2 

#1 glu gly asn cys his pro gin phe cys pro ser 1 

10 #3 glu gly asp cys his pro gin phe cys arg ser 1 

#13 glu gly asp cys his pro gin phe cys val ser 1 

cys his pro gin phe cys consensus 

Table 839: Sequences Obtained by 
is Enrichment over BSA 

nione am aiv v cvs v Y Y v cyb V ser — ErequencY 

#21 glu gly gly cys phe lys arg asn cys tyr ser 1 

#22 glu gly his cys asp lys lys ile cys leu ser 1 

20 #23 glu gly phe cys his thr ala ala cys phe ser 1 

#24 glu gly his cys tyr lys gly val cys ser ser 1 

#25 glu gly his cys asp lys trp arg cys pro ser 1 

#26 glu gly ile cys tyr arg leu asp cys ile ser 1 

#27 glu gly gly cys phe pro trp his cys phe ser 1 

25 #28 glu gly ser cys asp ser leu arg cys asp ser 1 
No consensus observed. 
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CLAIMS 

1. In a process for developing novel binding proteins 
with a desired binding activity against a particular target 
material comprising providing a population of genetic packages, 
5 each displaying one or more copies of a particular potential 
binding domain as part of a chimeric outer surface protein 
thereof, said potential binding domain not being natively 
associated with the outer surface of said package, said 
population collectively displaying a plurality of different 

10 potential binding domains, the differentiation among said 
plurality of different potential binding domains occurring 
through the at least partially random variation of one or more 
predetermined amino acid positions, but not all amino acid 
positions, of said parental binding domain to randomly obtain 

15 at each said variable position an amino acid belonging to a 
predetermined set of two or more amino acids, the amino acids 
of said set occurring at said position in predetermined 
expected proportions; contacting the packages with the target 
material; and separating the packages according to their 

20 affinity for said target material; 

the improvement comprising essentially each said 
potential binding domain being a mini-protein sequence of less 
than forty amino acids and having at least one intrachain 

25 covalent crosslink between at least a first amino acid position 
and a second amino acid position thereof, the amino acids at 
said first and second positions being invariant in all of the 
chimeric proteins displayed by said population, with those 
residues which participate in the formation of a covalent 

30 crosslink being invariant throughout said population, with the 
proviso that when the crosslink is in the form of a disulfide 
bond, the potential binding domain is a micro-protein sequence 
of less than forty amino acids. 
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2. The method of claim 1 wherein the crosslink is a 
disulfide bond and the the amino acids at the first and second 
amino acid positions are cysteines. 

3. The method of claim 2 in which the micro-protein 

5 domain has a single disulfide bond and the span of the bond is 
not more than nine amino acid residues. 

4. The method of claim 2 in which the micro -protein 
domain has a single disulfide bond, wherein the disulfide bond 
bridges a sequence of amino acids which under affinity 

10 separation conditions collectively assume a hairpin 
supers econdary structure. 

5. The method of claim 4 wherein the hairpin 
secondary structure is selected from the group consisting of 
(a) an a helix, a turn, and a 0 strand; (b) an a helix, a turn, 

15 and an a helix; and (c) a 0 strand, a turn, and a 0 strand. 

6. The method of claim 2 wherein the micro-protein 
domain comprises two intrachain disulfide bonds and preferably 
includes two clustered cysteines. 

7. The method of claim 6 wherein the micro-protein 

20 domain has two disulfide bonds having a connectivity pattern of 
1-3, 2-4. 

8. The method of claim 2 wherein the micro-protein 
domain comprises three intrachain disulfide bonds and 
preferably includes two clustered cysteins. 

25 9. The method of claim 8 wherein the micro-protein 

domain has three disulfide bonds having a connectivity pattern 

of 1-4, 2-5, 3-6. 

10. The method of claim 7 wherein the micro-protein 
domain substantially corresponds in sequence to an a-conotoxin. 
30 ii. The method of claim 9 wherein the micro-protein 

domain substantially corresponds in sequence to a mu- or omega- 
conotoxin. 

12. The method of claim 6 wherein the micro-protein 
domain substantially corresponds in sequence to a micro-protein 
35 s lected from the group consisting of Escherichia soli heat 
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stable toxin I (ST A ) , the bee venom apamin, or a squash- seed 
trypsin inhibitor, the scorpion toxin, charybdotoxin and 
secretory leukocyte protease inhibitor. 

13. The method of claim 1 wherein the covalent 

5 crosslink includes a metal atom, such as zinc, iron, copper or 
cobalt. 

14. The method of any of claims 1-13 wherein at least 
one variable amino acid position in said potential binding 
domains was encoded by a singly variegated codon selected from 

10 the group consisting of NNT, NNG, RNG, RMG, VNT, RRS, and SNT. 

15. The method of any of claims 1-13 wherein none of 
the variable amino acid positions in said potential binding 
domain was encoded by a sinply variegated codon selected from 
the group consisting of NNN, NNK and NNS. 

!5 16. The method of any of claims 1-13 wherein at least 

one variable amino acid position in said potential binding 
domains was encoded by a complexly variegated codon. 

17. The method of any of claims 1-16 wherein the 
replicable genetic package is a phage, preferably a DNA phage 

20 other than phage lambda, more preferably a filamentous phage. 

18. The method of claim 17 wherein the potential 
binding domain is fused with the major coat protein of a 
filamentous phage or a assemblable fragment thereof, or with 
the gene III protein of a filamentous phage or an assemblable 

25 fragment thereof . 

19. The method of any of claims 1-16 wherein the 
replicable genetic package is a bacterial cell, such as strains 

of EgGherichia coii . Salmonella typfr^riuim, fifi^iislssafiaaa 

aeruginosa . Klebsiella pneumonia, ffsAgperta gpnpyrtiQgfre , or 
30 Bacillus aubtilis . said DNA construct further comprises a 
periplasmic secretion signal sequence, and the potential 
binding domain is fused with a bacterial outer surface protein 
such as the lamB protein, OmpA, OmpC, OmpF, Phospholipase A, or 
pilin, or an assemblable segment thereof. 
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20. The method of any of claims 1-19 wherein said 
population is characterized by the display of at least 10 5 
different potential binding domains, and wherein, for any 
potentially encoded potential binding domain, the probability 

5 that it will be displayed by at least one package in said 
population is at least 50%, more preferably at least 90%. 

21. A library of display phage or cells, each 
displaying one or more copies of a particular potential binding 
domain as part of a chimeric outer surface protein thereof, 

10 said potential binding domain not being natively associated 
with the outer surface of said phage or cells, said library 
collectively displaying a plurality of different potential 
binding domains, the differentiation among said plurality of 
different potential binding domains occurring through the at 

15 least partially random variation of one or more predetermined 

amino acid positions, but not all amino acid positions, of said 
parental binding domain to randomly obtain at each said 
variable position an amino acid belonging to a predetermined 
set of two or more amino acids, the amino acids of said set 

20 occurring at said position in predetermined expected 
proportions, 

essentially each said potential binding domain being a mini- 
protein sequence of less than sixty amino acids and having at 

25 least one intrachain covalent crosslink between at least a 
first amino acid position and a second amino acid position 
thereof, the amino acids at said first and second positions 
being invariant in all of the chimeric proteins displayed by 
said population, with those residues which participate in the 

30 formation of a covalent crosslink being invariant throughout 

said population, with the proviso that when the crosslink is a 
disulfide bond, the potential binding domain is a micro-protein 
of less than 40 residues. 
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